Smiles string convention - formalize

ipendlet commented 5 years ago

Request by @jschrier to ensure that the smiles string representations in the chemicals dataframe are correctly (consistently) implemented.

Why: We're calculating large swaths of descriptors wrong by using inconsistent protonation conventions. Subsequent impacts on ML performance.

How: Adopt a consistent data representation (e.g. CC[CH3+].[I-]) by modifying the data in the Chemical Inventory google sheet. A brief discussion with @Michael Tynes seemed to indicate that ESCALATE does not rely on SMILES strings (with the exception of use in calculating the features in ChemAxon, which is exactly what we want it to change).

Test: We expect _feat to change (and be less wrong). _rxn outputs should remain the same.

Impact: We're doing it inconsistently. We already use these reps for the majority of organics, so there is no data handling problems. It's just a question of whether there are other links or keys that use smiles. For the most part, InChIKeys are used, and those remain unchanged

Estimate dev time: Very low. Data change. Test development. There shouldn't be anything that this breaks. If there is, that's a bigger problem that we'll want to know about ahead of the summit.

Priority: High. I'd rather that this doesn't wait until the end of the month.

ipendlet commented 5 years ago

Resolved the flagged issue by @jschrier . @gcatabr1 will generate a new perov_desc.csv (which contains the calculated features) in response. These will be committed to both capture and report master.

gcatabr1 commented 5 years ago

Generated new perov_desc.csv(5) file based on updated SMILES. Added to master.

darkreactions / ESCALATE_Capture

Smiles string convention - formalize #51