BlueObelisk / xml-cml.org

xml-cml.org website
http://www.xml-cml.org/
1 stars 0 forks source link

Review of imported repos #2

Open petermr opened 4 years ago

petermr commented 4 years ago

Imported repos 20191230:

euclid

A Java library for 2D and 3D geometric calculations.

This is fundamental to several other PMR repos, but development is now in the monolithic github.com/petermr/ami3

chemicaltagger

ChemicalTagger is a tool for semantic text-mining in chemistry.

standalone and (I think) currently working well thanks to mjw.

oscar4-cli

A set of small programs to run bits of the OSCAR4 software.

I think this is standalone and working but no evidence.

oscar4

OSCAR (Open Source Chemistry Analysis Routines) is an open source extensible system for the automated annotation of chemistry in scientific articles.

I think this is working well (mjw).

cmlxom

A Java library for processing CML.

XOM is still a widely used Java tool for XML and this should work with any later versions.

xml-cml.org

I am using issues in this issue to discuss general problems. Please feel feel to transfer elsewhere.

jumbo-converters

Converters for legacy to and from CML

A large set of modules for converting legacy ouptut into CML. Anything called jumbo-converters-foo is likely to be a module.

cmllite-validator-code

CML validation is best done via xpath expressions, not XML schema

cmllite-validator-ws

uses XSLT

jumbo6

A java editor/browser for CML. Almost certainly out of date.

chemicaltagger-webapp

No recent PMR knowledge

svg

A XOM for SVG.

Superseded by svg in http://github.com/petermr/ami3 . In wide use.

html

A XOM for HTML.

Superseded by html in http://github.com/petermr/ami3 . In wide use.

acpgeo

No immediate comment

jumboconverters-compchem

reads semi-formatted (lineprinter) output from about 10 packages

Useful but needs editing if the output formats change.

wwmm-pom

jumbo-testutil

Utilities to support unit tests

euclid-testutil

Utilities to support unit tests

jumboconverters-parent

parent module for jumbo-converters

cifxom

XOM for CIF (crystallography) files. I would touch base with COD - I suspect this is obsolete.

jumboconverters-cli

CLI for running jumboconverters

I suggest changing to picocli.net - a much better CLI.

jumboconverters-molecule

reads a variety of legacy molecule formats into CML

jumboconverters-top

toplevel module

jumbo-inchi

no immediate knowledge.

schtml

one of many attempts to get a normalised version of HTML for scientific articles

oscar4-uima

UIMA is an (IBM) Open source tool for running conformance and evaluation operations

pub-crawler

crawler for scientific articles/data

Probably obsolete

oscar4-chebi

ChEBI is an EBI chemical library

Suspect this is obsolete.

http-crawler

???

crystaleye

CIF crawler and database

Now merged with COD, so obsolete.

jumboconverters-*:

jumboconverters-spectrum

converts legacy spectra to cml-spect

jumboconverters-template

a general reader for semi-structured documents (e.g. FORTRAN output)

jumboconverters-react

jumboconverters-crystal

jumboconverters-composite

chemtreebank

???

oscar4-taverna

OSCAR under the Taverna workflow Probably obsolete.

quixote-dicts

dictionaries for the Quixote project

cml-dicts

CML dictionaries

cml-specs

CML specifications

cml-dictionary-*:

cml-dictionary-compchem

compchem dictionary

PMR we now have a new approach to dictionaries in ami3

cml-dictionary-units-nonsi

cml-dictionary-unit-types

cml-dictionary-compchem-nwchem

cml-dictionary-compchem-gaussian

cml-dictionary-cml-formula

cml-dictionary-cml-name

cml-dictionary-cml

cml-dictionary-units-si

cml-dictionary-cif

ostueker commented 4 years ago

Thank you @petermr for summarizing the content of the imported repos.

Archiving superseeded repos

As far as I can tell the modules within the jumboconverters-foo repos are superseeded by the submodules within the jumbo-converters repo.

Similarly newer versions of the dictionaries within the cml-dictionary-bar repos are contained within xml-cml.org.

I thinking to making final commits to the jumboconverters-foo and cml-dictionary-bar repos, just adding a README.md with (in case of jumboconverters-foo) the following content:

This repository is depricated as it's content has been inegrated into https://github.com/BlueObelisk/jumbo-converters .

This repository will remain in read-only state for reference.

and set it to an read-only state by archiving (see at the very bottom of e.g. https://github.com/BlueObelisk/jumboconverters-parent/settings ).

Updating pom.xml- and .(hg|git)ignore files.

We need to replace Bitbucket URL with the new URLs on github.com/BlueObelisk and replace the .hgignore with .gitignore.

Most parent POMs also set the UCC Repository, which doesn't seem to be any longer available:

<repositories>
  <repository>
    <id>ucc-repo</id>
    <name>UCC Repository</name>
    <url>https://maven.ch.cam.ac.uk/m2repo</url>
  </repository>
</repositories>

This brings me to the next point:

CI-CD and publishing to Maven Repos

I believe we need a place to publish SNAPSHOT artifacts so that we can get CI going: we have plenty of Maven dependencies that point to SNAPSHOT versions, which are not available on "Maven Central". Without a Maven-repository from which the (Travis-)CI jobs can pull those artifacts, we would need to resort to building and installing each SNAPSHOT dependency again in downstream projects.

I'm really not a Maven expert, but as far as I can tell publishing SNAPSHOT versions to "Maven Central" is at least discouraged. Also accoording to this

GitHub Packages does not support SNAPSHOT versions of Apache Maven.

Does anyone have a suggestion how to proceed here?

IMHO it would be nice if we could publish SNAPSHOT versions directly from the CI pipeline, however I would like to avoid having to host our own Nexus server.

License

It seems with a few exceptions (oscar4, oscar4-cli, oscar4-chebi & ChemicalTagger) none of the repositories have a LICENSE.txt file in their root directory.

Can I assume that all Java projects are implicitly using the Apache 2.0 license, as they all inherit this setting from the WWMM Parent POM?

Some, but not all also state the APACHE license in the file-headers. We could fix the missing file-headers using the license-maven-plugin.

Notable exceptions from the Apache license are all the CML dictionaries, schema, conventions, website, etc. which are "CC BY 3.0" and three of the the OSCAR repos, which seem to be under "Artistic-2.0" license.

Updating the xml-cml.org website.

We could probably host this site directly out of the GitHub repo. The HTML code uses Server-Side-Includes, which are not supported by GitHub-Pages, however we could rework the page into using Jelyll (Ruby based templating engine).

I can have a shot at this at some point, but don't think I'll have time until the summmer.

Interesting to know who holds the xml-cml.org domain.

petermr commented 4 years ago

I'll take the sections separately...

On Mon, Dec 30, 2019 at 7:24 PM Oliver Stueker notifications@github.com wrote:

Archiving superseeded repos

As far as I can tell the modules within the jumboconverters-foo repos are superseeded by the submodules within the jumbo-converters repo.

Sounds right.

Similarly newer versions of the dictionaries within the cml-dictionary-bar repos are contained within xml-cml.org.

I don't think the dictionaries are in active use (though I hope we can develop them), so pick whichever seems more uptodate. The dictionaries will be come much more valuable if we can link them to Wikidata. We have done a lot of this recently and it makes the dictionaries more authoritative.

I thinking to making final commits to the jumboconverters-foo and cml-dictionary-bar repos, just adding a README.md with (in case of jumboconverters-foo) the following content:

Yes, fix typos below

This repository is (deprecated) as (its) content has been (integrated) into https://github.com/BlueObelisk/jumbo-converters .

This repository will remain in read-only state for reference.

and set it to an read-only state by archiving (see at the very bottom of e.g. https://github.com/BlueObelisk/jumboconverters-parent/settings ).

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

On Mon, Dec 30, 2019 at 7:24 PM Oliver Stueker notifications@github.com wrote:

Thank you @petermr https://github.com/petermr for summarizing the content of the imported repos. Archiving superseeded repos

superseded

Updating pom.xml- and .(hg|git)ignore files.

We need to replace Bitbucket URL with the new URLs on github.com/BlueObelisk and replace the .hgignore with .gitignore.

Most parent POMs also set the UCC Repository, which doesn't seem to be any longer available:

ucc-repo UCC Repository https://maven.ch.cam.ac.uk/m2repo

I think this could be replaced by Maven Central

This brings me to the next point: CI-CD and publishing to Maven Repos

I believe we need a place to publish SNAPSHOT artifacts so that we can get CI going: we have plenty of Maven dependencies that point to SNAPSHOT versions, which are not available on "Maven Central".

AFAICR Maven Central requires numbered versions (which is a good thing as SNAPSHOT can refer to many versions).

Without a Maven-repository from which the (Travis-)CI jobs can pull those artifacts, we would need to resort to building and installing each SNAPSHOT dependency again in downstream projects.

I'm really not a Maven expert, but as far as I can tell publishing SNAPSHOT versions to "Maven Central" is at least discouraged. Also accoording to this https://help.github.com/en/github/managing-packages-with-github-packages/configuring-apache-maven-for-use-with-github-packages

GitHub Packages does not support SNAPSHOT versions of Apache Maven.

Does anyone have a suggestion how to proceed here?

We should create proper versions. This is only a problem if there are many interdependencies of repos, e.g A>B>C so that if A changes then B and C must be verified. I ran into this problem with AMI which had a stack of nearly 10 and so I bundled them all together.

IMHO it would be nice if we could publish SNAPSHOT versions directly from the CI pipeline, however I would like to avoid having to host our own Nexus server.

Agree with sentiment

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

On Mon, Dec 30, 2019 at 7:24 PM Oliver Stueker notifications@github.com wrote:

Thank you @petermr https://github.com/petermr for summarizing the content of the imported repos.

License

It seems with a few exceptions (oscar4, oscar4-cli, oscar4-chebi & ChemicalTagger) none of the repositories have a LICENSE.txt file in their root directory.

Probably true.

Can I assume that all Java projects are implicitly using the Apache 2.0 license, as they all inherit this setting from the WWMM Parent POM?

Yes. I think all authors came from PMR group or close associates.

Some, but not all also state the APACHE license in the file-headers. We could fix the missing file-headers using the license-maven-plugin.

Agreed

Notable exceptions from the Apache license are all the CML dictionaries, schema, conventions, website, etc. which are "CC BY 3.0" and three of the the OSCAR repos, which seem to be under "Artistic-2.0" license.

We picked Artistic before Apache became common. I doubt there are many authors who would object to a change to Apache. Suggest we post this suggestion on Blue Obelisk and give a deadline after which we convert.

Updating the xml-cml.org website.

We could probably host this site directly out of the GitHub repo. The HTML code uses Server-Side-Includes, which are not supported by GitHub-Pages, however we could rework the page into using Jelyll (Ruby based templating engine).

I can have a shot at this at some point, but don't think I'll have time until the summmer.

Interesting to know who holds the xml-cml.org domain.

I think Henry does.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BlueObelisk/xml-cml.org/issues/2?email_source=notifications&email_token=AAFTCSYYC4OICU5K3U4EXM3Q3JDG3A5CNFSM4KBK6JFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH3AJBY#issuecomment-569771143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS7J7X7B6ALIEH64YETQ3JDG3ANCNFSM4KBK6JFA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ostueker commented 4 years ago

Archiving superseded repos

I've added the README.md files with a deprecation message and then archived those jumboconverters-* and cml-dictionay-* repos, where I was very confident that more recent versions of their content are present in the jumbo-converters and xml-cml.org repos.

Similarly newer versions of the dictionaries within the cml-dictionary-bar repos are contained within xml-cml.org. I don't think the dictionaries are in active use (though I hope we can develop them), so pick whichever seems more uptodate. The dictionaries will be come much more valuable if we can link them to Wikidata. We have done a lot of this recently and it makes the dictionaries more authoritative.

Back in 2015/2016 our group did some work on the CML dictionaries, which are now waiting to be merged in xml-cml.org/#1.

CI-CD and publishing to Maven Repos

[...] AFAICR Maven Central requires numbered versions (which is a good thing as SNAPSHOT can refer to many versions).

[...] We should create proper versions. This is only a problem if there are many interdependencies of repos, e.g A>B>C so that if A changes then B and C must be verified. I ran into this problem with AMI which had a stack of nearly 10 and so I bundled them all together.

Yes, bundling dependent packages is one solution to this problem, another would be, releasing more frequently: Whenever project "A" implements a new feature or fixes a bug, that project "B" needs, a new release for project A is created. As long as Semantic Versioning is used, the impact for downstream projects should not be dramatic. And if CI pipelines exist, testing happens frequently and problems won't stay hidden very long.

IMHO it would be nice if we could publish SNAPSHOT versions directly from the CI pipeline, however I would like to avoid having to host our own Nexus server.

Agree with sentiment

I'll probably make some experiments with using GitHub-actions & -packages soon, using wwmm-parent, euclid and cmlxom. Staying within the GitHub platform, should make it easy to use the necessary GH_TOKENS for authentication against the Repo.

License

Can I assume that all Java projects are implicitly using the Apache 2.0 license, as they all inherit this setting from the WWMM Parent POM? Yes. I think all authors came from PMR group or close associates.

Good, Whenever I start working on a repo that is still lacking a LICENSE file, I will create a pull-request adding the Apache license.

Peter, if you don't mind I'll assign those PRs to you so that you can merge them. There won't be merge-conflicts so it will be a simple click.

We picked Artistic before Apache became common. I doubt there are many authors who would object to a change to Apache. Suggest we post this suggestion on Blue Obelisk and give a deadline after which we convert.

As far I could see earlier, the Artistic-2.0 is only used by some OSCAR repos. To me one of the licenses is as good as the other one. I would leave it up to @petermr and @mjw99 to decide whether to change the license or not.

ostueker commented 4 years ago

Updating the xml-cml.org website.

We could probably host this site directly out of the GitHub repo. The HTML code uses Server-Side-Includes, which are not supported by GitHub-Pages, however we could rework the page into using Jekyll (Ruby based templating engine).

I can have a shot at this at some point, but don't think I'll have time until the summer.

Interesting to know who holds the xml-cml.org domain.

I think Henry does.

Pinging @hrzepa .

petermr commented 4 years ago

Much of our software is analogous to mines - its value varies according to what the world is interested in. For example if people are interested in extracting data from Gaussian log files, jumbo-converters can do this. There's a cyclic gotcha - people wont mine logfiles unless there is working software and it's a labour of love to write software in advance of demand. What you (Oliver) has done is very valuable - preserving the reserves and making them more accessible. My hope is that if they were displayed again then people might pick them up and use them and start again.

I think the next action is probably to create a spreadsheet/webpage prospectus of what there is, what is does, hopefully an example or two and also preserve the history and authorship.

On Tue, Dec 31, 2019 at 12:15 AM Oliver Stueker notifications@github.com wrote:

Archiving superseded repos

I've added the README.md files with a deprecation message and then archived those jumboconverters- and cml-dictionay- repos, where I was very confident that more recent versions of their content are present in the jumbo-converters and xml-cml.org repos.

Similarly newer versions of the dictionaries within the cml-dictionary-bar repos are contained within xml-cml.org. I don't think the dictionaries are in active use (though I hope we can develop them), so pick whichever seems more uptodate. The dictionaries will be come much more valuable if we can link them to Wikidata. We have done a lot of this recently and it makes the dictionaries more authoritative.

Back in 2015/2016 our group did some work on the CML dictionaries, which are now waiting to be merged in xml-cml.org/#1 https://github.com/BlueObelisk/xml-cml.org/pull/1. CI-CD and publishing to Maven Repos

[...] AFAICR Maven Central requires numbered versions (which is a good thing as SNAPSHOT can refer to many versions).

[...] We should create proper versions. This is only a problem if there are many interdependencies of repos, e.g A>B>C so that if A changes then B and C must be verified. I ran into this problem with AMI which had a stack of nearly 10 and so I bundled them all together.

Yes, bundling dependent packages is one solution to this problem, another would be, releasing more frequently: Whenever project "A" implements a new feature or fixes a bug, that project "B" needs, a new release for project A is created. As long as Semantic Versioning https://semver.org/ is used, the impact for downstream projects should not be dramatic. And if CI pipelines exist, testing happens frequently and problems won't stay hidden very long.

IMHO it would be nice if we could publish SNAPSHOT versions directly from the CI pipeline, however I would like to avoid having to host our own Nexus server.

Agree with sentiment

I'll probably make some experiments with using GitHub-actions & -packages soon, using wwmm-parent, euclid and cmlxom. Staying within the GitHub platform, should make it easy to use the necessary GH_TOKENS for authentication against the Repo. License

Can I assume that all Java projects are implicitly using the Apache 2.0 license, as they all inherit this setting from the WWMM Parent POM? Yes. I think all authors came from PMR group or close associates.

Good, Whenever I start working on a repo that is still lacking a LICENSE file, I will create a pull-request adding the Apache license.

Peter, if you don't mind I'll assign those PRs to you so that you can merge them. There won't be merge-conflicts so it will be a simple click.

We picked Artistic before Apache became common. I doubt there are many authors who would object to a change to Apache. Suggest we post this suggestion on Blue Obelisk and give a deadline after which we convert.

As far I could see earlier, the Artistic-2.0 is only used by some OSCAR repos. To me one of the licenses is as good as the other one. I would leave it up to @petermr https://github.com/petermr and @mjw99 https://github.com/mjw99 to decide whether to change the license or not.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BlueObelisk/xml-cml.org/issues/2?email_source=notifications&email_token=AAFTCS3EVXF4MNKSN4SHGFLQ3KFLJA5CNFSM4KBK6JFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH3PSOY#issuecomment-569833787, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS7GBC6JSE7OQGKIXRTQ3KFLJANCNFSM4KBK6JFA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK