Open mcwitt opened 4 years ago
Thanks for initiating this!
Some feedback:
Decouple workserver-specific information from compound series specific information. In this case, fragment_id
and project numbers are compound series specific information, while the rest of the configuration involving paths and settings for the analysis pipeline are workserver-specific information. These should be split into distinct files.
Enable multiple compound series files to be specified: It should be easy for us to either specify a list of compound series JSON files to analyze within the workerserver file, or better yet, just drop the compound JSON files in a directory and have them picked up automatically.
Separate series data, compound data, and transformation data in separate sections in the compound series JSON file: We should aim to separate these into three sections rather than repeating everything in the transformation entries. Some example data to start with:
Indexed by name (which may be CID + a suffix to denote stereoisomers or different protonation states):
Indexed by RUN:
RUN0
)MAT-POS-f42f3716-1
)MAT-POS-f42f3716-2
)monomer/Mpro-x10789_0_bound-protein-thiolate.pdb
)fragment_id
(denotes protein structure and reference ligand)The protein structure identifier
needs to reference the protein structure we used for the transformation, which for now is a string like Mpro-x10789_0_bound-protein-thiolate.pdb
. Later, this could be something more codified.
@hannahbrucemacdonald : Anything I'm missing here? Would this support the more complex transformation maps that you had in mind?
For each compound series, we should also include a scaffold_smarts
that would match the part of the common scaffold we hope will stay in place. This could be used to either compute the RMSD for the matching atoms in a post-processing step or to impose restraints during the preparation step.
SMARTS syntax is documented here: https://en.wikipedia.org/wiki/SMILES_arbitrary_target_specification This tool is useful for understanding SMARTS expressions.
For example, Sprint 3 (benzotriazoles) uses the following scaffold SMARTS to match the common parts of all molecule designs: c1ccc(NC(=O)[C,N]n2nnc3ccccc32)cc1
How did you draw that??
@hannahbrucemacdonald : Anything I'm missing here? Would this support the more complex transformation maps that you had in mind?
I think the more complicated transformation maps just need to be handled at the analysis end. I think what you've suggested is very comprehensive
I would possibly add something so that it's easy to identify if something has a stereocenter and/or if there's a calculation of an additional stereocenter
How did you draw that?? It was this great tool from Matthias Rarey (same folks behind proteins.plus).
I would possibly add something so that it's easy to identify if something has a stereocenter and/or if there's a calculation of an additional stereocenter
This is a good point. We should separate out submitted compounds
(the stuff we put in a test tube, which can exist in multiple protonation/tautomeric states and multiple enantiomers/diastereomers if racemic mixtures) from the specific molecules
we compute transformations for (which always have a well-defined protonation/tautomeric/stereochemical state).
That suggests we should have two distinct sections in the JSON:
"compounds" : {
"MAT-POS-f42f3716-1" : {
"CID": "MAT-POS-f42f3716-1",
"pIC50": 4.324,
"smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
"racemic_mixture": false,
"multiple_protonation_states": true,
"multiple_tautomers": false,
},
"MAT-POS-f42f3716-2": {
"CID": "MAT-POS-f42f3716-2",
"pIC50": 4.617,
"smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(S(C)(=O)=O)cc2Cl)c1",
"racemic_mixture": false,
"multiple_protonation_states": false,
"multiple_tautomers": false,
}
},
"molecules": {
"MAT-POS-f42f3716-1-1" : {
"name": "MAT-POS-f42f3716-1-1",
"CID": "MAT-POS-f42f3716-1",
"smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
},
"MAT-POS-f42f3716-1-2": {
"name": "MAT-POS-f42f3716-1-2",
"CID": "MAT-POS-f42f3716-1",
"smiles": "Cc1ccn[H+]cc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
}
}
Thanks! Looks good to me. I think it helps to have this stuff upfront in the json
Putting this all together, I think the new compound JSON should look like
{
"schema_version": 0,
"metadata": {
"name": "2020-08-20-benzotriazoles",
"description": "Sprint 3: Prioritization of benzotriazole derivatives",
"creator": "John D. Chodera",
"creation_date": "Thu Aug 20 03:25:55 UTC 2020",
"xchem_project": "Mpro",
"biological_assembly": "monomer",
"protein_variant": "thiolate",
"temperature": "300*kelvin",
"ionic_strength": "70*millimolar",
"pH": 7.4,
},
"compounds" : {
"MAT-POS-f42f3716-1" : {
"compound_id": "MAT-POS-f42f3716-1",
"smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
"is_racemic_mixture": false,
"has_multiple_protonation_states": true,
"has_multiple_tautomers": false,
"experimental_data": {
"pIC50": 4.324,
}
},
"MAT-POS-f42f3716-2": {
"compound_id": "MAT-POS-f42f3716-2",
"smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(S(C)(=O)=O)cc2Cl)c1",
"is_racemic_mixture": false,
"has_multiple_protonation_states": false,
"has_multiple_tautomers": false,
"experimental_data": {
"pIC50": 4.324,
}
}
},
"molecules": {
"MAT-POS-f42f3716-1-1" : {
"molecule_id": "MAT-POS-f42f3716-1-1",
"CID": "MAT-POS-f42f3716-1",
"smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
},
"MAT-POS-f42f3716-1-2": {
"molecule_id": "MAT-POS-f42f3716-1-2",
"CID": "MAT-POS-f42f3716-1",
"smiles": "Cc1ccn[H+]cc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
}
},
"transformations": {
"RUN0" : {
"run": "RUN0",
"initial_molecule": "MAT-POS-f42f3716-1-1",
"final_molecule": "MAT-POS-f42f3716-1-2",
"xchem_fragment_id": "x10789",
},
"RUN1" : {
"run": "RUN1",
"initial_molecule": "MAT-POS-f42f3716-1-1",
"final_molecule": "MAT-POS-f42f3716-1-3",
"xchem_fragment_id": "x10789",
},
}
}
Just for clarity, a "compound" is something that we can broadly ask enamine to synthesise, but a "compound" has multiple "molecules" enumerating stereochemistry/charge/tautomers that we need to distinguish explicitly in MM forcefields?
Just for clarity, a "compound" is something that we can broadly ask enamine to synthesise, but a "compound" has multiple "molecules" enumerating stereochemistry/charge/tautomers that we need to distinguish explicitly in MM forcefields?
Yes!
Putting this all together, I think the new compound JSON should look like
This looks great to me!
As we continue to add parameters to the analysis pipeline, the current strategy of passing all configuration as a bunch of unstructured command-line arguments is becoming unwieldy.
The configuration should be structured and modularized in a such a way to cleanly separate parameters into understandable groups (e.g. separating server configuration, analysis parameters). In cases where a sub-configuration is repeated, for example the free-energy analysis configuration for both complex and solvent, the schema should be defined in one place only; this will improve consistency and reduce maintenance burden.
I propose changing removing most arguments from the main entry point and accepting JSON configuration file instead. As an example of a possible schema with the information that might be included:
Open questions
num_procs
,cache_dir
to be set on the command line).