This revises how unique molecules are determined for ANI2x and updates dataset yaml files.

Previously, for ANI-2X, the array of atomic species was assumed to be unique, as no other means of grouping conformers was provided in the dataset. Previously, any instance of the species array within the datafile array was grouped together. This resulted in 15992 molecules being identified. The paper does not seem to list the exact number of unique molecules, but states:

"We began by adding molecules with two heavy atoms and iterated at this number until the number of molecules added per iteration began to decrease; then, the process was repeated by adding one heavy atom at a time. In the end, more than 50 active learning cycles were carried out, yielding a data set of 4,695,707 molecular conformations from 13,405 chemical isomers. Combined with the original ANI-1x data set and torsion refinement data set from the ANI-1ccx work, the final ANI-2x data set consists of 8.9 million molecular conformations. "

When I parse ANI-1X I get 3114 molecules, so that means there should be 16519 total for ANI-2X (3114+13405). In this PR, we now consider the order in the dataset a conformer appears, in addition to the species array as well. Only entries with the same atomic species array that also appear in order are considered to be the same molecule for purposes of grouping conformers. That is, if the same species array is encountered later in the array, it is treated as a new molecule. Molecule names are now based on the string made from their species array along with the molecule number appended to them to ensure unique names. This revised approach gives 16514 unique molecules. This is off by 5 (assuming the numbers previously quoted are correct), which is a big improvement compared to being off by 527. I tried adding a third filter which considered energy differences between conformers, but was not able to figure out where these additional 5 come from (could be a few very similar isomers or possibly some number of molecules were filtered out from the ANI-1x subset when creating ANI-2X). I'll note the number of conformers in both cases was correct, and I do not expect this minor different to play a substantial role in training.

A few notes; large datesets timeout when uploading to zenodo on my internet connect. A simple work around is to upload (via the API) from lilac which has a much faster connection. I will full document this process and provide sample scripts

Description

1) modify ani2x curation 2) update yaml files so datasets are downloaded from zenodo

TODO:

[x] upload spice 1 to zenodo
[x] upload spice 2 to zenodo
[x] upload qm9 to zenodo
[ ] upload spice 1 openff to zenodo
[x] upload ANI1x to zenodo
[x] upload ANI2x to zenodo
[x] upload PhAlkEthOH to zenodo

Status

[ ] Ready to go

choderalab / modelforge