Open jchodera opened 3 years ago
with the risk of me not fully comprehending the structure - perhaps it would be useful so supply smiles for each transformation beside just compound id? Might save people the extra step of having to write queries to find the smiles.
I would suggest adding some sort of documentation (potentially as docstrings in schema.py?) to create some clarity on the contents (e.g. metadata vs microdata, gens vs clones etc)
I would suggest adding some sort of documentation (potentially as docstrings in schema.py?) to create some clarity on the contents (e.g. metadata vs microdata, gens vs clones etc)
Agreed, that's critical now.
@glass-w : Could you implement this in a PR? You would modify schema.py to define Field(default, description=...)
objects with the description
kwarg specified?
For example, we would change this:
class CompoundMetadata(Model):
compound_id: str
smiles: str
experimental_data: Dict[str, float]
to
from pydantic import Field
class CompoundMetadata(Model):
compound_id: str = Field(None, 'The unique compound identifier (PostEra or enumerated ID)')
smiles: str = Field(None, 'The SMILES string defining the compound in a canonical protonation state. Stereochemistry will be ambiguous for racemates.')
experimental_data: Dict[str, float] = Field(dict(), 'Optional experimental data fields, such as "pIC50"')
More info here.
with the risk of me not fully comprehending the structure - perhaps it would be useful so supply smiles for each transformation beside just compound id? Might save people the extra step of having to write queries to find the smiles.
@JenkeScheen: This sounds like a reasonable tradeoff for convenience!
What other info would you like for each transformation?
I think the data looks complete, if you could point me to your methods I might be able to pinpoint more points of interest but at least for my purposes this would do. On a side-note, do you have a recommended API for parsing this file? I've been looking for a way to use schema.py with pydantic to load the file without much luck.
I suggest adding to metadata a protocol keyword to cross-reference to a loosely formatted protocol dictionary. The free energies for the same transformation may be protocol dependent, and it may be useful in the future to analyse this if datasets are processed multiple times with different protocols. The protocol may be a good place to define how statistical uncertainties are estimated ''stderr'' alone doesn't quite define it (e.g. number of replicates, measures for decorrelating samples...)
It would also be useful to cross-link the experimental data to a particular version maintained by the people making the measurements. Experimental data can change over time, particularly for live projects.
Also I assume it would be possible to work out easily where the 3D inputs and parameters are for each transformation from the current JSON specs (given access to the full 3D dataset) ?
This issue is for discussing what we might want to add to the JSON spec for analyzed sprint data.
The JSON spec is programmatically defined in
schema.py
using pydantic.Here's an example of the current sprint 4 JSON:
What additional data should we be storing?
cc: @alphaleegroup @glass-w @jmichel80 @JenkeScheen @ppxasjsm