choderalab / fah-xchem

Tools and infrastructure for automated compound discovery using Folding@home
MIT License
6 stars 3 forks source link

Discussing JSON spec for analyzed sprints #95

Open jchodera opened 3 years ago

jchodera commented 3 years ago

This issue is for discussing what we might want to add to the JSON spec for analyzed sprint data.

The JSON spec is programmatically defined in schema.py using pydantic.

Here's an example of the current sprint 4 JSON:

{
  "as_of": "2020-10-02T01:28:15.276861+00:00",
  "series": {
    "metadata": {
      "name": "2020-09-06-ugi-tBu-x3110-3v3m-2020-04-Jacobs",
      "description": "COVID Moonshot Sprint 4 to prioritize Ugi compounds based on ALP-POS-c59291d4-6 to optimize substituents in P2 pocket",
      "creator": "John Chodera <john.chodera@choderalab.org>",
      "created_at": "2020-09-16",
      "xchem_project": "Mpro",
      "receptor_variant": {
        "catalytic-dyad": "His41(+) Cys145(-)"
      },
      "temperature_kelvin": 300,
      "ionic_strength_millimolar": 70,
      "pH": 7.3,
      "fah_projects": {
        "complex_phase": 13426,
        "solvent_phase": 13427
      }
    },
    "compounds": [
      {
        "metadata": {
          "compound_id": "3v3m-2020-04-Jacobs",
          "smiles": "CC(C)(C)c1ccc(cc1)N([C@H](c2cccnc2)C(=O)NC(C)(C)C)C(=O)c3ccco3",
          "experimental_data": {
            "pIC50": 5.50248779245608
          }
        },
        "microstates": [
          {
            "microstate": {
              "microstate_id": "3v3m-2020-04-Jacobs",
              "free_energy_penalty": {
                "point": 0,
                "stderr": 0
              },
              "smiles": "CC(C)(C)c1ccc(cc1)N([C@H](c2cccnc2)C(=O)NC(C)(C)C)C(=O)c3ccco3"
            },
            "free_energy": {
              "point": -11.865402961234466,
              "stderr": 0.0816827737002715
            },
            "first_pass_free_energy": {
              "point": -4.201355022038555,
              "stderr": 0.006056352520384547
            }
          }
        ],
        "free_energy": {
          "point": -11.865402961234466,
          "stderr": 0.0816827737002715
        }
      },
...
    ],
    "transformations": [
      {
        "transformation": {
          "run_id": 5,
          "xchem_fragment_id": "x3110",
          "initial_microstate": {
            "compound_id": "EN300-7493272",
            "microstate_id": "EN300-7493272_2"
          },
          "final_microstate": {
            "compound_id": "3v3m-2020-04-Jacobs",
            "microstate_id": "3v3m-2020-04-Jacobs"
          }
        },
        "binding_free_energy": {
          "point": 4.440634333978821,
          "stderr": 0.32562849768831825
        },
        "complex_phase": {
          "free_energy": {
            "delta_f": {
              "point": 27.248070651638173,
              "stderr": 0.2702924852312691
            },
            "bar_overlap": 0.11066590018883393,
            "num_work_pairs": 180
          },
          "gens": [
            {
              "gen": 1,
              "works": [
                {
                  "clone": 1,
                  "forward": 34.7782015908238,
                  "reverse": -10.241186412803977
                },
                {
                  "clone": 2,
                  "forward": 44.734220461096854,
                  "reverse": -24.896325524695808
                },
...
              ],
              "gen": 2,
              "works": [
                {
                  "clone": 1,
                  "forward": 37.156309587111465,
                  "reverse": -8.767506308081487
                },
                {
                  "clone": 2,
                  "forward": 40.13445953021964,
                  "reverse": -30.576113177923528
                },
...
              ],
              "free_energy": {
                "delta_f": {
                  "point": 31.691335251103947,
                  "stderr": 0.4292940751774469
                },
                "bar_overlap": 0.21842918614395224,
                "num_work_pairs": 30
              }
            }
          ]
        }
      },
    ]
}

What additional data should we be storing?

cc: @alphaleegroup @glass-w @jmichel80 @JenkeScheen @ppxasjsm

JenkeScheen commented 3 years ago
jchodera commented 3 years ago

I would suggest adding some sort of documentation (potentially as docstrings in schema.py?) to create some clarity on the contents (e.g. metadata vs microdata, gens vs clones etc)

Agreed, that's critical now.

@glass-w : Could you implement this in a PR? You would modify schema.py to define Field(default, description=...) objects with the description kwarg specified?

For example, we would change this:

class CompoundMetadata(Model):
    compound_id: str
    smiles: str
    experimental_data: Dict[str, float]

to

from pydantic import Field

class CompoundMetadata(Model):
    compound_id: str = Field(None, 'The unique compound identifier (PostEra or enumerated ID)')
    smiles: str = Field(None, 'The SMILES string defining the compound in a canonical protonation state. Stereochemistry will be ambiguous for racemates.')
    experimental_data: Dict[str, float] = Field(dict(), 'Optional experimental data fields, such as "pIC50"')

More info here.

jchodera commented 3 years ago

with the risk of me not fully comprehending the structure - perhaps it would be useful so supply smiles for each transformation beside just compound id? Might save people the extra step of having to write queries to find the smiles.

@JenkeScheen: This sounds like a reasonable tradeoff for convenience!

What other info would you like for each transformation?

JenkeScheen commented 3 years ago

I think the data looks complete, if you could point me to your methods I might be able to pinpoint more points of interest but at least for my purposes this would do. On a side-note, do you have a recommended API for parsing this file? I've been looking for a way to use schema.py with pydantic to load the file without much luck.

jmichel80 commented 3 years ago

I suggest adding to metadata a protocol keyword to cross-reference to a loosely formatted protocol dictionary. The free energies for the same transformation may be protocol dependent, and it may be useful in the future to analyse this if datasets are processed multiple times with different protocols. The protocol may be a good place to define how statistical uncertainties are estimated ''stderr'' alone doesn't quite define it (e.g. number of replicates, measures for decorrelating samples...)

It would also be useful to cross-link the experimental data to a particular version maintained by the people making the measurements. Experimental data can change over time, particularly for live projects.

Also I assume it would be possible to work out easily where the 3D inputs and parameters are for each transformation from the current JSON specs (given access to the full 3D dataset) ?