This resolves #40 by removing a duplicate molecule from the database. To prevent related issues in the future, functionality is added to allow easy checking of duplicates.
(It turns out that to ensure all SMILES are canonicalized to allow adequate checks for duplicates, if the primary data is SMILES strings, it's necessary to go SMILES -> OEMol -> canonical isomeric SMILES and then cross-check the canonical isomeric SMILES in all cases; this procedure hadn't been done in exactly this way before which allowed this one duplicate to sneak through.)
This also makes additional other minor changes:
Some updates to README.md to reflect this, and add reference to latest paper
(I'll also add a separate issue: I think it's time we set up automated testing, and one of the things it should do -- aside from checking that the database can be extracted, etc. -- is check for duplicates.)
This resolves #40 by removing a duplicate molecule from the database. To prevent related issues in the future, functionality is added to allow easy checking of duplicates.
(It turns out that to ensure all SMILES are canonicalized to allow adequate checks for duplicates, if the primary data is SMILES strings, it's necessary to go SMILES -> OEMol -> canonical isomeric SMILES and then cross-check the canonical isomeric SMILES in all cases; this procedure hadn't been done in exactly this way before which allowed this one duplicate to sneak through.)
This also makes additional other minor changes: