Closed ipendlet closed 5 years ago
Resolved the flagged issue by @jschrier . @gcatabr1 will generate a new perov_desc.csv (which contains the calculated features) in response. These will be committed to both capture and report master.
Generated new perov_desc.csv(5) file based on updated SMILES. Added to master.
Request by @jschrier to ensure that the smiles string representations in the chemicals dataframe are correctly (consistently) implemented.
Why: We're calculating large swaths of descriptors wrong by using inconsistent protonation conventions. Subsequent impacts on ML performance.
How: Adopt a consistent data representation (e.g.
CC[CH3+].[I-]
) by modifying the data in the Chemical Inventory google sheet. A brief discussion with @Michael Tynes seemed to indicate that ESCALATE does not rely on SMILES strings (with the exception of use in calculating the features in ChemAxon, which is exactly what we want it to change).Test: We expect
_feat
to change (and be less wrong)._rxn
outputs should remain the same.Impact: We're doing it inconsistently. We already use these reps for the majority of organics, so there is no data handling problems. It's just a question of whether there are other links or keys that use smiles. For the most part, InChIKeys are used, and those remain unchanged
Estimate dev time: Very low. Data change. Test development. There shouldn't be anything that this breaks. If there is, that's a bigger problem that we'll want to know about ahead of the summit.
Priority: High. I'd rather that this doesn't wait until the end of the month.