Move configuration to JSON

mcwitt commented 4 years ago

As we continue to add parameters to the analysis pipeline, the current strategy of passing all configuration as a bunch of unstructured command-line arguments is becoming unwieldy.

The configuration should be structured and modularized in a such a way to cleanly separate parameters into understandable groups (e.g. separating server configuration, analysis parameters). In cases where a sub-configuration is repeated, for example the free-energy analysis configuration for both complex and solvent, the schema should be defined in one place only; this will improve consistency and reduce maintenance burden.

I propose changing removing most arguments from the main entry point and accepting JSON configuration file instead. As an example of a possible schema with the information that might be included:

{
  "schema_version": 0,
  "run_details_json_file": "/home/server/2020-08-14-nucleophilic-displacement.json",
  "complex_project": {
    "path": "/home/server/server2/projects/13422",
    "data_path": "/home/server/server2/data/SVR314342810/PROJ13422"
  },
  "solvent_project": {
    "path": "/home/server/server2/projects/13422",
    "data_path": "/home/server/server2/data/SVR314342810/PROJ13422"
  },
  "binding_analysis": {
    "complex_phase": {
      "min_num_work_values": 40,
      "work_precision_decimals": 3,
      "filter_work_values": {
        "max_value": 1e4,
        "max_ndevs": 5
      }
    },
    "solvent_phase": {
      "min_num_work_values": 40,
      "work_precision_decimals": 3,
      "filter_work_values": {
        "max_value": 1e4,
        "max_ndevs": 5
      }
    }
  },
  "structure_snapshots": {
    "fragment_id": "x10789",
    "max_binding_delta_f": -999,
    "output_path": "./structures"
  }
}

Open questions

Are we missing essential configuration options? (keeping in mind the code should be flexible to easily iterate on the schema)
What information should be captured in JSON configuration versus CL arguments? I'm leaning toward requiring anything that could affect the output to be configured in the JSON (possibly allowing num_procs, cache_dir to be set on the command line).
Should we separate configuration into multiple JSONs? E.g. @jchodera suggested a server-level configuration (containing things like path prefixes) separate from project-level.

jchodera commented 4 years ago

Thanks for initiating this!

Some feedback:

Decouple workserver-specific information from compound series specific information. In this case, fragment_id and project numbers are compound series specific information, while the rest of the configuration involving paths and settings for the analysis pipeline are workserver-specific information. These should be split into distinct files.

Enable multiple compound series files to be specified: It should be easy for us to either specify a list of compound series JSON files to analyze within the workerserver file, or better yet, just drop the compound JSON files in a directory and have them picked up automatically.

Separate series data, compound data, and transformation data in separate sections in the compound series JSON file: We should aim to separate these into three sections rather than repeating everything in the transformation entries. Some example data to start with:

Series data

Series name
Series description or annotation
Date or timestamp
Project numbers
Sprint metadata (e.g. "Sprint 3"?)

Compound data

Indexed by name (which may be CID + a suffix to denote stereoisomers or different protonation states):

PostEra compound ID (CID)
Canonical isomeric SMILES
Any metadata that is available, such as experimental pIC50, that could be used to reconstruct free energies

Transformation data

Indexed by RUN:

RUN (e.g. RUN0)
initial compound name (e.g. MAT-POS-f42f3716-1)
final compound name (e.g. MAT-POS-f42f3716-2)
protein structure identifier used in simulation (e.g. monomer/Mpro-x10789_0_bound-protein-thiolate.pdb)
reference fragment_id (denotes protein structure and reference ligand)

The protein structure identifier needs to reference the protein structure we used for the transformation, which for now is a string like Mpro-x10789_0_bound-protein-thiolate.pdb. Later, this could be something more codified.

@hannahbrucemacdonald : Anything I'm missing here? Would this support the more complex transformation maps that you had in mind?

jchodera commented 4 years ago

For each compound series, we should also include a scaffold_smarts that would match the part of the common scaffold we hope will stay in place. This could be used to either compute the RMSD for the matching atoms in a post-processing step or to impose restraints during the preparation step.

SMARTS syntax is documented here: https://en.wikipedia.org/wiki/SMILES_arbitrary_target_specification This tool is useful for understanding SMARTS expressions.

For example, Sprint 3 (benzotriazoles) uses the following scaffold SMARTS to match the common parts of all molecule designs: c1ccc(NC(=O)[C,N]n2nnc3ccccc32)cc1

hannahbrucemacdonald commented 4 years ago

How did you draw that??

hannahbrucemacdonald commented 4 years ago

@hannahbrucemacdonald : Anything I'm missing here? Would this support the more complex transformation maps that you had in mind?

I think the more complicated transformation maps just need to be handled at the analysis end. I think what you've suggested is very comprehensive

hannahbrucemacdonald commented 4 years ago

I would possibly add something so that it's easy to identify if something has a stereocenter and/or if there's a calculation of an additional stereocenter

jchodera commented 4 years ago

How did you draw that?? It was this great tool from Matthias Rarey (same folks behind proteins.plus).

I would possibly add something so that it's easy to identify if something has a stereocenter and/or if there's a calculation of an additional stereocenter

This is a good point. We should separate out submitted compounds (the stuff we put in a test tube, which can exist in multiple protonation/tautomeric states and multiple enantiomers/diastereomers if racemic mixtures) from the specific molecules we compute transformations for (which always have a well-defined protonation/tautomeric/stereochemical state).

That suggests we should have two distinct sections in the JSON:

"compounds" : {
    "MAT-POS-f42f3716-1" : {
        "CID": "MAT-POS-f42f3716-1",
        "pIC50": 4.324,
        "smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
        "racemic_mixture": false,
        "multiple_protonation_states": true,
        "multiple_tautomers": false,
     },
     "MAT-POS-f42f3716-2": {
        "CID": "MAT-POS-f42f3716-2",
        "pIC50": 4.617,
        "smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(S(C)(=O)=O)cc2Cl)c1",
        "racemic_mixture": false,
        "multiple_protonation_states": false,
        "multiple_tautomers": false,
     }
 },
"molecules": {
   "MAT-POS-f42f3716-1-1" : {
        "name": "MAT-POS-f42f3716-1-1",
        "CID": "MAT-POS-f42f3716-1",
        "smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
     },
     "MAT-POS-f42f3716-1-2": {
        "name": "MAT-POS-f42f3716-1-2",
        "CID": "MAT-POS-f42f3716-1",
        "smiles": "Cc1ccn[H+]cc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
     } 
}

hannahbrucemacdonald commented 4 years ago

Thanks! Looks good to me. I think it helps to have this stuff upfront in the json

jchodera commented 4 years ago

Putting this all together, I think the new compound JSON should look like

{
  "schema_version": 0,

  "metadata": {
    "name": "2020-08-20-benzotriazoles",
    "description": "Sprint 3: Prioritization of benzotriazole derivatives",
    "creator": "John D. Chodera",
    "creation_date": "Thu Aug 20 03:25:55 UTC 2020",
    "xchem_project": "Mpro",
    "biological_assembly": "monomer",
    "protein_variant": "thiolate",
    "temperature": "300*kelvin",
    "ionic_strength": "70*millimolar",
    "pH": 7.4,
  },

  "compounds" : {
    "MAT-POS-f42f3716-1" : {
        "compound_id": "MAT-POS-f42f3716-1",
        "smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
        "is_racemic_mixture": false,
        "has_multiple_protonation_states": true,
        "has_multiple_tautomers": false,
        "experimental_data": {
          "pIC50": 4.324,
       }
     },
     "MAT-POS-f42f3716-2": {
        "compound_id": "MAT-POS-f42f3716-2",
        "smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(S(C)(=O)=O)cc2Cl)c1",
        "is_racemic_mixture": false,
        "has_multiple_protonation_states": false,
        "has_multiple_tautomers": false,
       "experimental_data": {
         "pIC50": 4.324,
       }
     }
 },

"molecules": {
   "MAT-POS-f42f3716-1-1" : {
        "molecule_id": "MAT-POS-f42f3716-1-1",
        "CID": "MAT-POS-f42f3716-1",
        "smiles": "Cc1ccncc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
     },
     "MAT-POS-f42f3716-1-2": {
        "molecule_id": "MAT-POS-f42f3716-1-2",
        "CID": "MAT-POS-f42f3716-1",
        "smiles": "Cc1ccn[H+]cc1NC(=O)Cc1cc(Cl)cc(-c2ccc(C3CC3(F)F)cc2)c1",
     } 
  },

  "transformations": {
     "RUN0" : {
      "run": "RUN0",
      "initial_molecule": "MAT-POS-f42f3716-1-1",
      "final_molecule": "MAT-POS-f42f3716-1-2",
      "xchem_fragment_id": "x10789",
    },
     "RUN1" : {
      "run": "RUN1",
      "initial_molecule": "MAT-POS-f42f3716-1-1",
      "final_molecule": "MAT-POS-f42f3716-1-3",
      "xchem_fragment_id": "x10789",
    },
  }
}

hannahbrucemacdonald commented 4 years ago

Just for clarity, a "compound" is something that we can broadly ask enamine to synthesise, but a "compound" has multiple "molecules" enumerating stereochemistry/charge/tautomers that we need to distinguish explicitly in MM forcefields?

jchodera commented 4 years ago

Just for clarity, a "compound" is something that we can broadly ask enamine to synthesise, but a "compound" has multiple "molecules" enumerating stereochemistry/charge/tautomers that we need to distinguish explicitly in MM forcefields?

Yes!

mcwitt commented 4 years ago

Putting this all together, I think the new compound JSON should look like

This looks great to me!

choderalab / fah-xchem