Using OPTIMADE for Biomolecular data

Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases

https://optimade.org/specification

Creative Commons Attribution 4.0 International

77 stars 37 forks source link

Using OPTIMADE for Biomolecular data #389

Open JPBergsma opened 2 years ago

JPBergsma commented 2 years ago

I recently had a discussion with Adam Hospital about making the OPTIMADE standard suitable for biomolecular data.

The most important data to share, that is not standard in Optimade, would be:

The name of the atom. In PDB files, each atom type has a name. These names are not unique, but indicate atoms in similar positions. For example “CA“ for the carbon atom to which the side chains in an amino acid are connected. This for example allows only displaying atoms in the side chains or in the backbone.
Proteins often contain multiple subunits. Each subunit is an individual chain and these chains are in turn build up out of amino acids which in turn are build up of atoms. In the PDB file format, there is a row for each atom which contains: ◦ The kind of amino acid/residue the atom belongs to. ◦ The Residue sequence number/rank of this amino acid in the chain. (the first residue does not need to be residue 1) ◦ The chain it belongs to (Chain ID).

In the PDB files, this information is stored for each atom. There is however some redundancy with this approach. If you know to which residue an atom belongs, you would only need to store the rank of this residue and the chain to which this residue belongs once, instead of storing it for every atom.

Possible implementations:

My idea would be to use the species field for the atoms, just as normal in OPTIMADE.
If we want to include the number of hydrogen atoms attached to an atom, which could be useful to distinguish between single and double bonds, we would need to have more types of species as the number of attached hydrogen atoms is defined on the species level. Instead of writing “CA” for the alpha carbon atom in glycine, we could name the species “CAH2” and for other amino acids it could be “CAH”. In that case we could use the original name subfield to hold the original name: “CA”.

To include the information about the chains, I see two options.

One option would be to add fields with a list of values belonging to each atom. For example, for each atom/site a property called “residue” could specify to what kind of residue the atom belongs, e.g. “Leu” for Leucine or “DA” for the nucleic acid adenosine etc. Similarly, a property called “chainid” can be used to specify to which chain each atom belongs. And finally, a “residue_sequence_number” property to store the number/rank of the residue in the sequence.

This would be very similar to the way it is currently implemented in the PDB files and would be easy to implement because of the similarity with the data structure in the PDB file. The disadvantages would be that there is some redundant data with this approach and that there is no top-down information. You would need to visit the values for all the atoms before you would know the number of chains and the sequence of the residues in a chain. The latter could be mended by adding more top-down information, such as is done with the “SEQRES” field in the PDB file, which gives the sequence of the amino acids/nucleotides in a chain. This would however also increase the amount of data that would need to be stored/sent. In comparison to the total amount of data of a trajectory, it is however not much.

A second option would be to create a tree like structure.

To do this, we could introduce a "groups" property. This would consist of a list of dictionaries. With a separate dictionary for each group. Each group would have an ID field holding a unique ID as a string. (Unique for this particular structure/trajectory, not necessarily for the entire database.)
It would have a “members” field, which holds integers to refer to the atoms/sites and strings to refer to other groups. By giving each group a label field, consisting of a list of strings, it will be easy to select a particular set of groups.

For biomolecular data, we could implement this as follows: On the highest level, we could define a group containing the whole protein complex. Its members would in that case be the individual chains and perhaps also some co-factors. We could give this group the label "protein".

On a lower level we could then define the chains as groups as well. Its members would be the individual residues. In this case the residues must be listed starting from the residue with the free amine group. For DNA and RNA one should start at the 5' end. We could give this group the label "polypeptide" or "DNA" or "RNA".

At the lowest level, we could then define the individual residues. In that case the members would be individual atoms. We could give this group the label "residue". It should furthermore contain information on the residue sequence number/rank and the type of the residue, e.g. which amino acid or nucleic acid. One option to do this would be to add extra fields to these groups to store the type of residue and the residue sequence number/rank.

The advantages of this approach is that a bit less data would need to be sent. And that the data can be approached in a top down way. It would however also be more complex and perhaps less intuitive if you are familiar with the PDB format.

This "groups" feature could also be used to indicate other groups, such as the residues that together form an alpha Helix. And it could also be used outside the biomolecular field to indicate atoms that belong to a certain group. Such as all the atoms belonging to a certain molecule. So even if we do not use it here, it may be worthwhile to create a separate PR for the "groups" feature.

Currently, I am leaning towards the first approach, as this would be the simplest to implement, starting from the PDB file format, and it would also be easy to convert it back to a PDB file. I am however curious what your ideas and feedback on this topic are. In the meantime, I will try to pour these ideas into a form that fits with the current OPTIMADE specification.

CasperWA commented 2 years ago

As a note, I'll mention that the "adapter" implementation in OPTIMADE Python tools implements converting a single OPTIMADE "structures" datum into a PDB file (old style format). There is also the beginnings of creating a modern PDBx (mmCIF), however, this is of course all based on the current information available in the "structures" data model and might miss critical information that cannot currently be expressed by the OPTIMADE data model.

JPBergsma commented 2 years ago

Thank you for pointing this out. I did not know this function existed. If we, however, want to use it for biomolecular data, we will at least need to add the Residue name and the Residue sequence number.

merkys commented 2 years ago

I am starting to think that extending the OPTIMADE Structures representation to adapt it to the whole variety of structures is not attainable (although all my faith lies with the ontology group!). The variety is just too large, specification development is quite slow, and in the end software will have to be developed to ingest the data in OPTIMADE format.

Thus for the time being I would suggest using already existing and well-known file formats. There is PDB and PDBx/mmCIF, and most of the software processing biomolecular data is able to read them. Here I would like to advertise my PR #360 on supporting the Files endpoint, of course :)

JPBergsma commented 2 years ago

Thank you for your comment. Bio-molecules are indeed quite different from the materials that are currently in the OPTIMADE databases. Apart from the chemical formulas, most of the OPTIMADE fields are however also relevant for this research field.
I think that in the future we will have to specify a lot more OPTIMADE fields, so searching materials with a certain set of properties across all the databases becomes a lot easier. In that case, having a few extra properties for the bio-molecular field should not be such a problem.

The PDB and PDBx/mmCIF formats you mention do not support trajectories. In practice, visualization tools often allow you to sequentially read many PDB files. So you can construct a trajectory in that way, but this gives quite some extra overhead as data is duplicated between the files. Some simulation programs (Amber and Gromacs) do have data formats for trajectories, but I do not know how broad the support for these formats is.

Finally, as far as I understood, the project proposal on which I was hired also mentions that we are going to make OPTIMADE suitable for bio-molecules and trajectories. So I think we have to fulfil this promise.

merkys commented 2 years ago

Thank you for your comment. Bio-molecules are indeed quite different from the materials that are currently in the OPTIMADE databases. Apart from the chemical formulas, most of the OPTIMADE fields are however also relevant for this research field. I think that in the future we will have to specify a lot more OPTIMADE fields, so searching materials with a certain set of properties across all the databases becomes a lot easier. In that case, having a few extra properties for the bio-molecular field should not be such a problem.

I agree that a few extra properties will not hurt, and your suggestion for hierarchical groups of sites might as well be reused for other types of structures. I am a bit worried though about the introduction of properties that make sense only for a certain group of structures, although they have quite generic names. But this could probably be solved by the ontology group. Another worry is that OPTIMADE ontology for biopolymers diverges from the PDB. I am already dealing with aligning CIF ontology with OPTIMADE structure properties (see #342 for example), and this is tedious.

The PDB and PDBx/mmCIF formats you mention do not support trajectories. In practice, visualization tools often allow you to sequentially read many PDB files. So you can construct a trajectory in that way, but this gives quite some extra overhead as data is duplicated between the files. Some simulation programs (Amber and Gromacs) do have data formats for trajectories, but I do not know how broad the support for these formats is.

I do not expect the overhead of expressing trajectories in PDB to be that large. Coordinates of both atoms and heteroatoms, cell parameters and symmetry are subject to change in trajectories. Thus the only constant things are the sequences (not even secondary, as these may change) and bibliography/other annotations. Or am I missing something?

Finally, as far as I understood, the project proposal on which I was hired also mentions that we are going to make OPTIMADE suitable for bio-molecules and trajectories. So I think we have to fulfil this promise.

Sure.

JPBergsma commented 2 years ago

I think the first method I describe is very close to how things are done in PDB files. So I do not foresee problems with implementing it in that way. We would have to preprocess the PDB files, so we would not need to extract the data from them every time a request is made, but instead can use a method that fits with the database backend. We could in that case also generate data for the fields that are currently in OPTIMADE. Most of the OPTIMADE fields are on the should level, so if it is really not possible to extract the information, we could exclude that information. The second method would require more preprocessing, but all the necessary information is in the PDB files.

The idea is that this will be used in combination with the new trajectories endpoint. (I just noted that I did not mention this in the first post. So I can imagine you were a bit puzzled by my proposal.) In a PDB file for each atom at each frame/snapshot you include the label (ATOM) and the information about: the atom number, the name of the atom, the residue it belongs to, the chain it belongs to and the position of the residue in the chain. (it is a fixed width format so not including those fields won't help.) Even if you can leave the occupancy, element, bfactor, segment identifier and charge fields empty, and there is no other repeated(meta)data, you would save at least a factor 2 in the amount of data by using the OPTIMADE format.

d-beltran commented 2 years ago

Hi everyone,

My name is Dani and I am the technician from Adam Hospital's group who is going to implement the OPTIMADE specifications in a biomolecules database (BioExcel-CV19). I have read all specifications, the discussion and the two options on how to include biomolecular data. The option I like the most is the second, but there are a few things I would like to discuss.

I like the 'groups' idea because this way we could emulate the two main atom classifiers in biomolecule data: residues and chains. There is no need to create a group to define what is a protein or a nucleic acid since it is rarely defined explicitly in the structure files I know. Many tools just deduct this from residue names. Here is an example of how I would define a protein with the current system:

{
    "group_type":[
        {
            "name": "A",
        },
        {
            "name": "B",
        },
        {
            "name": "PHE",
        },
        {
            "name": "ASP",
        },
        (the rest of aminoacids)
    ],
    "groups":[
        {
            "group_id": "A-PHE-17",
            "type": "PHE",
            "residue_sequence_number": 17,
            "insertion_code": null,
            "sites":[0,1,2,3, ...]
        },
        {
            "group_id": "A-ASP-18",
            "type": "ASP",
            "residue_sequence_number": 18,
            "insertion_code": null,
            "sites":[17,18,19,20, ...]
        },
        {
            "group_id": "A-LEU-18-A",
            "type": "LEU",
            "residue_sequence_number": 18,
            "insertion_code": "A",
            "sites":[29,30,31, ...]
        },
        (the rest of residues),
        {
            "group_id": "Chain A",
            "type": "A",
            "sub_groups": ["A-PHE-17", "A-ASP-18", "A-LEU-18-A", ...],
        },
        {
            "group_id": "Chain B",
            "type": "B",
            "sub_groups": ["B-MET-7", "B-ILE-8", ...],
        }
    ]
}

This could work for us and, as Johan stated, it seems flexible enough to cover further implementations for other molecular fields which may require some kind of grouping. However, if you are willing to set field-specific properties then I have some suggestions. I understand adding properties with generic names when only specific groups of molecules are going to use it is not optimal so I suggest using molecular-field prefixes (e.g. 'biomolecule_').

In our case, the group types would be redundant. Chain/Residue formulas or mass are not usually needed and, if we need them, we can deduct/calculate them from the client. The subgroups property in the group types makes not sense for our case either. At the end, the only thing we need in order to identify chains/residues is a name.

Also in our case, setting groups for the chains could be redundant. I would just add a 'chain' field inside the residue groups. Chain has to be inside the group anyway since we may have two residues with the same name, number and icode but different chain. Note that chain is implicit in the group_id property in my previous example. Then getting the residues for each chain in the client should be comfortable enough using a filter. In addition, if there is no chains group we can also get rid of residue ids.

To sum up, I would set a single field and I would call it 'biomolecule_residues':


biomolecule_residues
~~~~~~

- **Description**: A particular residue
  Databases are allowed to add more properties as long as the properties are prefixed with the database specific prefix.
- **Type**: list of dictionaries with the properties:
   - :property:`chain`: string or null (REQUIRED)
   - :property:`name`: string (REQUIRED)
   - :property:`number`: integer (REQUIRED)
   - :property:`insertion_code`: string or null (REQUIRED)
   - :property:`sites`: list of integers (REQUIRED)
- **Requirements/Conventions**:
   - **Query**:  Support for queries on this property is OPTIONAL.
     If supported, only a subset of the filter features MAY be supported.
   - **chain**: The chain letter. It MUST NOT be longer than 1 character. It MAY be null.
   - **name**: The residue name. It SHOULD NOT be longer than 3 characters. It MUST NOT be longer than 4 characters.
   - **number**: The residue number according to source notation.
   - **insertion_code**: The residue insertion code. It MUST NOT be longer than 1 character. It MAY be null.
   - **sites**: A list of integers referring to the index of :field:`cartesian_site_positions`, that belong to this residue.
     The list SHOULD NOT be empty. The index of the first site is 0.
   - There SHOULD NOT be two or more residues with the same :property:`chain`, :property:`residue` and :property:`insertion_code`.
   - There MUST NOT be two or more residues with the same integer in :property:`sites`.
   - All :property:`chain`, :property:`name` and :property:`insertion_code` values SHOULD be in capital letters.

- **Examples**:

.. code:: jsonc
  {
    "biomolecule_residues":[
      {
        "chain": "A",
        "name": "PHE",
    "number": 17,
    "insertion_code": null,
        "sites":[0,1,2,3, ...]
      },
      {
        "chain": "A",
        "name": "ASP",
    "number": 18,
    "insertion_code": null,
        "sites":[17,18,19,20, ...]
      },
      {
        "chain": "A",
        "name": "LEU",
    "number": 18,
        "insertion_code": "A",
        "sites":[29,30,31, ...]
      },
    ]
  }

What do you think?

I also agree with the sequence implementation. It is not a MUST but it is a SHOULD. It would be useful for queries. I think it should be two fields:

biomolecule_sequences: a list of strings containing sequences. It SHOULD be capital letters.
biomolecule_sequence_types: a list of strings containing the type for each sequence in the field above (i.e. 'aminoacids' or 'nucleotides').

About the trajectories specification I have not a clear answer. There are many things to take in count. PDB format is not efficient because it stores all atoms data for each frame as previously stated. Many trajectory formats rely on the restriction that atoms must be the same along frames. Then they only store coordinates. I have no idea of how trajectories are in the materials field. Maybe we should meet to discuss this part.

Looking forward to read your feedback :)

JPBergsma commented 2 years ago

Dear Dani,

Thank you for your remarks,

Your implementation of the groups and group_type fields(as described in PR#396) in your example looks good to me.

The idea of field/domain specific property names has come up before and I am personally not against this. I can bring it up during the next Optimade meeting to see how others think about this, as it is a design choice that goes beyond just the fields for the biomolecular data. Perhaps we can also use a shorter prefix like bio or biomol.

In our case, the group types would be redundant. Chain/Residue formulas or mass are not usually needed and, if we need them, we can deduct/calculate them from the client. The subgroups property in the group types makes not sense for our case either. At the end, the only thing we need in order to identify chains/residues is a name.

I mostly placed the mass and chemical formula fields within the group_type to show you can store properties that are common to all groups of that type here. That way this information does not need to be repeated for each individual group. The mass and chemical formula can be reproduced from the information in the species and sites and the three letter codes in PDB files also define what kind of group it is, so for you the group_type property may be redundant.

Whether chain has to be inside the residue depends on how you approach the data. With a top-down approach, you would first identify the chains and subsequently go over the residues in each chain, so within the residue a chain field would not be needed. With a bottom up approach, you do have to specify to which chain a residue belongs.

Your proposal seems to be intermediate between the methods I suggested. It is certainly possible to do it in that way. I still have a few questions about it though:

Would you also use the residue field to describe other molecules in the structure? If that is the case, your requirements for the name field seems rather restrictive. There are far more possible molecules than 3 or 4-letter codes. So I think we should allow for longer strings for molecules that do not have a 3 or 4-letter code.
I think we should allow the chain identifier to be more than 1 character as there could be more than 26 molecules and/or chains in a structure, and it would be nice if we could label them as well. (We do not have to reproduce the limitations of the PDB format.)

Originally, I had the idea to store the sequences and what kind of sequence it is in the “group_type” property. And as a group that contains a chain would have a type property which refers to this group_type, it should be relatively easy to find chains with a particular sequence. I was therefore wondering whether the sequences in your proposal are ordered. So it is easier to determine which chain belongs to which sequence.

Our trajectory proposal(PR#377) is very flexible. It allows you to set a property as constant, meaning that it has a constant value throughout the trajectory, so the data does not have to be repeated for each frame. It however also allows redefining properties halfway through a trajectory. So if the number of particles is not constant, you can redefine the species list.

I will ask some other persons involved with OPTIMADE to give their opinion as well.

d-beltran commented 2 years ago

Hi Johan,

Perhaps we can also use a shorter prefix like bio or biomol.

It looks even better to me.

I mostly placed the mass and chemical formula fields within the group_type to show you can store properties that are common to all groups of that type here. That way this information does not need to be repeated for each individual group.

To my knowledge, chains have no common properties. They are just labels. However it is true this may change in the future so it is a good idea having chains defined as independent groups.

Whether chain has to be inside the residue depends on how you approach the data. With a top-down approach, you would first identify the chains and subsequently go over the residues in each chain, so within the residue a chain field would not be needed. With a bottom up approach, you do have to specify to which chain a residue belongs.

I did not have in mind the top-down structure and I agree it may be handy. In that case, I would create another field (e.g. biomol_chains). Every chain would have a list with references to its residues and here I suggest using indices instead of ids. If residues and chains are in separated lists then using indices is more clean. Using ids would result in more data and having to include the chain in every residue id somehow as I stated before, which would result in redundant implicit data. In addition, at the client side, finding elements in an array by their indices is faster than finding elements which match a property value.

Would you also use the residue field to describe other molecules in the structure? If that is the case, your requirements for the name field seems rather restrictive. There are far more possible molecules than 3 or 4-letter codes. So I think we should allow for longer strings for molecules that do not have a 3 or 4-letter code.

I think we should allow the chain identifier to be more than 1 character as there could be more than 26 molecules and/or chains in a structure, and it would be nice if we could label them as well. (We do not have to reproduce the limitations of the PDB format.)

I totally agree. It will be difficult to export to PDB a structure with wild chain/residue names but we will find out how to handle it.

Originally, I had the idea to store the sequences and what kind of sequence it is in the “group_type” property. And as a group that contains a chain would have a type property which refers to this group_type, it should be relatively easy to find chains with a particular sequence. I was therefore wondering whether the sequences in your proposal are ordered. So it is easier to determine which chain belongs to which sequence.

It may be done this way either. However, an author may tag a single protein with several chains or even tag a protein and a nucleic acid with the same chain since there is total freedom when setting chains. This should not happen but it may. In my proposal there was total freedom when setting sequences as well so each server may implement a different solution (e.g. sequences by fragments of bonded atoms).

Our trajectory proposal(PR#377) is very flexible. It allows you to set a property as constant, meaning that it has a constant value throughout the trajectory, so the data does not have to be repeated for each frame. It however also allows redefining properties halfway through a trajectory. So if the number of particles is not constant, you can redefine the species list.

Thanks for pointing me to this discussion. I missed it and now I have read it. The constant/explicit system is brilliant. It seems flexible enough to cover many different trajectory formats. I am just concerned about one thing I missed in the discussion. Have you considered, somehow, the possibility of sending at least coordinates data in binary format? This could widely reduce the size of data we are sending and I think efficiency is important in this point, especially when dealing with big trajectories. I am afraid JSON is not compatible with binary. If this has not been discussed I can copy this last question in PR#377.

So, to sum up previous changes, the field-specific implementation could be like this:

biomol_chains
~~~~~~

- **Description**: A particular chain
  Databases are allowed to add more properties as long as the properties are prefixed with the database specific prefix.
- **Type**: list of dictionaries with the properties:
   - :property:`name`: string (REQUIRED)
   - :property:`residues`: list of integers (REQUIRED)
- **Requirements/Conventions**:
   - **Query**:  Support for queries on this property is OPTIONAL.
     If supported, only a subset of the filter features MAY be supported.
   - **name**: The chain name/letter
   - **residues**: A list of integers referring to the index of :field:`biomol_residues`, that belong to this chain.
     The list SHOULD NOT be empty. The index of the first residue is 0.
   - There SHOULD NOT be two or more chains with the same :property:`name`.
   - Values in :property:`name` SHOULD be in capital letters.

- **Examples**:

.. code:: jsonc
  {
    "biomol_chains":[
      {
        "name": "A",
        "residues":[0,1,2,3, ...]
      },
      {
        "name": "B",
        "residues":[54,55,56,57, ...]
      },
    ]
  }

biomol_residues
~~~~~~

- **Description**: A particular residue
  Databases are allowed to add more properties as long as the properties are prefixed with the database specific prefix.
- **Type**: list of dictionaries with the properties:
   - :property:`name`: string (REQUIRED)
   - :property:`number`: integer (REQUIRED)
   - :property:`insertion_code`: string or null (REQUIRED)
   - :property:`sites`: list of integers (REQUIRED)
- **Requirements/Conventions**:
   - **Query**:  Support for queries on this property is OPTIONAL.
     If supported, only a subset of the filter features MAY be supported.
   - **name**: The residue name
   - **number**: The residue number according to source notation.
   - **insertion_code**: The residue insertion code. It MUST NOT be longer than 1 character. It MAY be null.
   - **sites**: A list of integers referring to the index of :field:`cartesian_site_positions`, that belong to this residue.
     The list SHOULD NOT be empty. The index of the first site is 0.
   - There MUST NOT be two or more residues with the same integer in :property:`sites`.
   - All :property:`name` and :property:`insertion_code` values SHOULD be in capital letters.

- **Examples**:

.. code:: jsonc
  {
    "biomol_residues":[
      {
        "name": "PHE",
    "number": 17,
    "insertion_code": null,
        "sites":[0,1,2,3, ...]
      },
      {
        "name": "ASP",
    "number": 18,
    "insertion_code": null,
        "sites":[17,18,19,20, ...]
      },
      {
        "name": "LEU",
    "number": 18,
        "insertion_code": "A",
        "sites":[29,30,31, ...]
      },
    ]
  }

merkys commented 2 years ago

I think limiting chain name to a single character is too restrictive. However, if the plan is to stay compatible with the PDB format, then it makes sense.

To my knowledge, chains have no common properties. They are just labels.

Not sure if I agree. In PDB, for example, not all chains are protein chains (see 328D for example). Thus at least telling whether a chain is protein/nucleic acid/something else would be beneficial.

JPBergsma commented 2 years ago

Dear Dani,

To my knowledge, chains have no common properties. They are just labels. However, it is true this may change in the future so it is a good idea having chains defined as independent groups.

There are some fields in the PDB format that can contain data about the Chains. The COMPND field can contain data belonging to the Molecules(Incl. chains) such as the sequence of the DNA bases or the name of a chain/molecule. The DBREF field can contain reference to other structures with the same chain. It would also seem logical to me to associate the sequences of the SEQRES field with one of the chains. So I do think chains can have common properties.

I did not have in mind the top-down structure and I agree it may be handy. In that case, I would create another field (e.g. biomol_chains). Every chain would have a list with references to its residues and here I suggest using indices instead of ids. If residues and chains are in separated lists then using indices is more clean. Using ids would result in more data and having to include the chain in every residue id somehow as I stated before, which would result in redundant implicit data. In addition, at the client side, finding elements in an array by their indices is faster than finding elements which match a property value.

We can indeed use the indexes of the residues instead of the IDs. In that case, it should always be clear what kind of thing(atom/chain/residue etc.) you are referring to.

I totally agree. It will be difficult to export to PDB a structure with wild chain/residue names but we will find out how to handle it.

:thumbsup:

It may be done this way either. However, an author may tag a single protein with several chains or even tag a protein and a nucleic acid with the same chain since there is total freedom when setting chains. This should not happen but it may. In my proposal there was total freedom when setting sequences as well so each server may implement a different solution (e.g. sequences by fragments of bonded atoms).

If I understand it correctly, you mean that a single chain can be labelled by two different chain IDs? E.g., the first half of the polymer would have ID “A” and the second half “B” ?

Thanks for pointing me to this discussion. I missed it and now I have read it. The constant/explicit system is brilliant. It seems flexible enough to cover many different trajectory formats. I am just concerned about one thing I missed in the discussion. Have you considered, somehow, the possibility of sending at least coordinates data in binary format? This could widely reduce the size of data we are sending and I think efficiency is important in this point, especially when dealing with big trajectories. I am afraid JSON is not compatible with binary. If this has not been discussed I can copy this last question in PR#377.

I do not think we have discussed this, So you are welcome to add this to the discussion. I have been pondering about this myself as well. Storing the numbers in a binary format could reduce the amount of data by a factor 2 to 3. The Optimade standard does allow you to specify additional formats for the returned data, besides JSON. So you could, for example, support the BSON format, which does support binary data and is otherwise still quite similar to the JSON format, so it should not be too much work to implement it.

Regarding your proposed Biomol_chain field:

It would be nice if you could add a bit more of an explanation of what the biomol_chains field is for.
Do you want to restrict the chain name more, such that if the number of chains is less or equal to the number of letters in the alphabet it has to be one letter.
Do we still want to have the sequence of the chain in here as well to allow easier searching?

Regarding your proposed Biomol_residue field:
”number” sounds rather general to me, I think it would be better to be a bit more specific. Like “residue_number”.

One subject we have not discussed yet is how people would query the data. Which fields would need to be present, so they can find the trajectories they are looking for?

(edited to remove spelling mistakes)

d-beltran commented 2 years ago

Hi Andrius,

I think limiting chain name to a single character is too restrictive. However, if the plan is to stay compatible with the PDB format, then it makes sense.

I agreed with no limiting chain name to a single character.

Not sure if I agree. In PDB, for example, not all chains are protein chains (see 328D for example). Thus at least telling whether a chain is protein/nucleic acid/something else would be beneficial.

Structures in the PDB are usually curated and it may be difficult to find an example of wrong chains but I have dealt with several wild PDBs and a chain can be pretty much anything, even protein, nucleic acid and something else at the same time. Some authors simply do not care about chains so they just tag everything with the same chain.

I agree it would be beneficial having some field to tell what a chain is but if we are going to add it then I think we must be very flexible and this field should be optional.

Johan,

There are some fields in the PDB format that can contain data about the Chains. The COMPND field can contain data belonging to the Molecules(Incl. chains) such as the sequence of the DNA bases or the name of a cahin/molecule. The DBREF field can contain reference to other structures with the same chain. It would also seem logical to me top associate the sequences of the SEQRES field with ine of the chains. So I do think chains can have common properties

To be honest I totally forgot about PDB remarks. Most tools I know just ignore these remarks since they only care about the atoms part. So yes, if we want to be fully compatible with PDB we have to support chain-specific metadata. The 'biomol_chains' field is a must then. And if there is also molecule-specific metadata we may think about a new 'biomol_molecules' field. I must read PDB remarks documentation before further discussion.

If I understand it correctly, you mean that a single chain can be labelled by two different chain IDs? E.g., the first half of the polymer would have ID “A” and the second half “B” ?

I've seen things you people wouldn't believe.

The Optimade standard does allow you to specify additional formats for the returned data, besides JSON. So you could, for example, support the BSON format, which does support binary data and is otherwise still quite similar to the JSON format, so it should not be too much work to implement it.

Perfect! Actually the database software we are using (Mongo) uses BSON internally so if this format is supported we can make a very efficient implementation.

It would be nice if you could add a bit more of an explanation of what the biomol_chains field is for.

Alright

Do you want to restrict the chain name more such that if the number of chains is less or equal to the number of letters in the alphabet it has to be one letter.

Good idea

Do we still want to have the sequence of the chain inhere is well to allow easier searching?

Okey

”number” sounds rather general to me, I think it would be better to be a bit more specific. Like “residue_number”.

In my opinion the residue prefix would be redundant here since we are inside a field called '_residues' already. If you prefer adding the prefix it is okey for me anyway, but maybe then we should rename all properties to be coherent: 'residue_name', 'residue_insertion_code', etc.

One subject we have not discussed yet is how people would query the data. Which fields would need to be present, so they can find the trajectories they are looking for.

I guess the most interesting fields are database-specific fields (e.g. trajectory name/id, description, authors, etc.). Structural interesting fields, in addition to sequences, would be atom count and not much more. Trajectory interesting fields would be length (both in time and snapshots count) or the timestep. All this is already considered in your current implementation if I am not wrong.

So, the biomol implementation would look like this:

biomol_chains
~~~~~~

- **Description**: For each chain in the system there is a dictionary that describes this chain. Chains are groups of related residues (e.g. a polymer).
  Databases are allowed to add more properties as long as the properties are prefixed with the database specific prefix.
- **Type**: list of dictionaries with the properties:
   - :property:`name`: string (REQUIRED)
   - :property:`residues`: list of integers (REQUIRED)
   - :property:`types`: list of strings
   - :property:`sequences`: list of strings
   - :property:`sequence_types`: list of strings
- **Requirements/Conventions**:
   - **Query**:  Support for queries on this property is OPTIONAL.
     If supported, only a subset of the filter features MAY be supported.
   - **name**: The chain name/letter.
   - **residues**: A list of integers referring to the index of :field:`biomol_residues`, that belong to this chain.
     The list SHOULD NOT be empty. The index of the first residue is 0.
   - **types**: A list of tags specifying the type of molecules this chain contains.
   - **sequences**: A list of residue sequences in current chain.
   - **sequence_types**: A list of tags specifying the type of residue for each sequence.
   - There SHOULD NOT be two or more chains with the same :property:`name`.
   - Values in :property:`name` SHOULD be in capital letters.
   - Values in :property:`name` SHOULD NOT be longer than 1 character when the number of chains is not greater than the number of letters in English alphabet (26).
   - Values in :property:`sequences` SHOULD be in capital letters.
   - Number of values in :property:`sequences` and :property:`sequence_types` MUST match.

- **Examples**:

.. code:: jsonc
  {
    "biomol_chains":[
      {
        "name": "A",
        "residues":[0,1,2,3, ...],
        "types": ['protein', 'ions'],
    "sequences": ['MSHHWGYG'],
    "sequence_types": ['aminoacids']
      },
      {
        "name": "B",
        "residues":[54,55,56,57, ...],
        "types": ['nucleic acid'],
    "sequences": ['GATTACA'],
    "sequence_types": ['nucleotides']
      },
    ]
  }

biomol_residues
~~~~~~

- **Description**: For each residue in the system there is a dictionary that describes this residue. Residues are groups of related atoms (e.g. an aminoacid).
  Databases are allowed to add more properties as long as the properties are prefixed with the database specific prefix.
- **Type**: list of dictionaries with the properties:
   - :property:`name`: string (REQUIRED)
   - :property:`number`: integer (REQUIRED)
   - :property:`insertion_code`: string or null (REQUIRED)
   - :property:`sites`: list of integers (REQUIRED)
- **Requirements/Conventions**:
   - **Query**:  Support for queries on this property is OPTIONAL.
     If supported, only a subset of the filter features MAY be supported.
   - **name**: The residue name
   - **number/residue_number**: The residue number according to source notation.
   - **insertion_code**: The residue insertion code. It MUST NOT be longer than 1 character. It MAY be null.
   - **sites**: A list of integers referring to the index of :field:`cartesian_site_positions`, that belong to this residue.
     The list SHOULD NOT be empty. The index of the first site is 0.
   - There MUST NOT be two or more residues with the same integer in :property:`sites`.
   - All :property:`name` and :property:`insertion_code` values SHOULD be in capital letters.

- **Examples**:

.. code:: jsonc
  {
    "biomol_residues":[
      {
        "name": "PHE",
    "number": 17,
    "insertion_code": null,
        "sites":[0,1,2,3, ...]
      },
      {
        "name": "ASP",
    "number": 18,
    "insertion_code": null,
        "sites":[17,18,19,20, ...]
      },
      {
        "name": "LEU",
    "number": 18,
        "insertion_code": "A",
        "sites":[29,30,31, ...]
      },
    ]
  }

Now, I would like to open one last discussion. Adam asked me if we were planning to add topology data. I do not know how topologies work in the materials field or even if you have things called topologies so excuse me if I explain something obvious. In our field topologies are files which include metadata (e.g. force field) and structural data (e.g. dihedrals) required to run MD simulations. Is Optimade meant to support topologies? Or just structure data?

JPBergsma commented 2 years ago

In my opinion the residue prefix would be redundant here since we are inside a field called '_residues' already. If you prefer adding the prefix it is okey for me anyway, but maybe then we should rename all properties to be coherent: 'residue_name', 'residue_insertion_code', etc.

Ok, fair enough.

I guess the most interesting fields are database-specific fields (e.g. trajectory name/id, description, authors, etc.). Structural interesting fields, in addition to sequences, would be atom count and not much more. Trajectory interesting fields would be length (both in time and snapshots count) or the timestep. All this is already considered in your current implementation if I am not wrong.

There is already an nsites field. In contrast to atom_count, this, however, does include the atoms of water molecules. In some cases the sites of the hydrogen atoms are not specified, so it could also be less than the number of atoms.

We could add an atom_count field, in that case it would however require a clear definition of which atoms should be counted and which not. E.g., would atoms of another solvent than water count, or the phospholipids in case of membranes or dissolved salt ions ?

I do not think I explicitly mentioned the duration of the trajectory(in units of time), but it should be easy to add. Over time, we will probably add a lot more properties anyway.

About your proposed biomol_chain:

I am not sure, I understand what the sequence_types field is. Could you elaborate the explanation a bit more?

On second thought (I know I am counter arguing myself.) perhaps it is not so important that the chain name is just a single character. It should be easy to convert the names to letters anyway.

I noticed you have multiple values in the types field(['protein', 'ions']). How are these related to the chain? Are they simply a label or do you also want to have a standard meaning for each value. Does the ions field mean that the chain is charged or are there also atoms that are not part of the polypeptide marked as being part of this chain?

Now, I would like to open one last discussion. Adam asked me if we were planning to add topology data. I do not know how topologies work in the materials field or even if you have things called topologies so excuse me if I explain something obvious. In our field topologies are files which include metadata (e.g. force field) and structural data (e.g. dihedrals) required to run MD simulations. Is Optimade meant to support topologies? Or just structure data?

Unfortunately, at the moment Optimade has only standardized the sharing of data about the structures. The part of your topology() data that applies to the structure (regions that are helixes/ beta sheets etc.) should also be shared at the structures endpoint.

Some of the other databases share data that was calculated based on the structure but are not structural properties themselves at the calculations endpoint(for example, the spectra of GFP). They also sometimes share information about how the simulation/measurement was done(force fields, packages etc.) at this calculations endpoint. So for now you could do the same. At the moment the Optimade standard only mentions that there is a calculations endpoint but does not give any more in formation about it, which is highly unsatisfying. I hope that this will be changed in the future.

If you feel happy about the proposed bio-fields you can make a draft PR. If you prefer I can also do it for you. On the 25th of February at 5pm CET there is an Optimade meeting. If you like we could discuss your proposal there with some more people.

d-beltran commented 2 years ago

We could add an atom_count field, in that case it would however require a clear definition of which atoms should be counted and which not. E.g., would atoms of another solvent than water count, or the phospholipids in case of membranes or dissolved salt ions ?

Atom count for me is useful to know how big is the structure so you can predict if you can handle it. For this reason I think atom count should refer to all explicit atoms in the system. If hydrogens are missing then they should not count. There could be several more specific atom counts (protein atom count, nucleic acid atom count, etc.) but I would let these fields to be implemented as database-specific fields since there are many interesting possibilities. Now I was wondering how to handle trajectories with non-constant atom count. Maybe atom count should refer to the maximum atom count along frames?

I am not sure, I understand what the sequence_types field is. Could you elaborate the explanation a bit more?

The main idea of the sequence types is to specify the kind of every sequence in the sequences field. This is useful to avoid ambiguous situations where it is not clear what a sequence is. For instance, if we have a short sequence with only 'A', 'G', 'C' and 'T' letters then it could be DNA or a small peptide since these letters may refer to both nucleotides or aminoacids.

- **sequence_types**: A list of tags specifying the type of each sequence in the :property:`sequences` field. The type of a sequence is defined by its components (e.g. 'aminoacids').

On second thought (I know I am counter arguing myself.) perhaps it is not so important that the chain name is just a single character. It should be easy to convert the names to letters anyway.

It is not so important, but it could be a SHOULD.

The way I would handle this is: 1) If we have chain names with unique initial letters then we use these letters (e.g. a structure with 2 chains called 'potato' and 'tomato' would be exported to PDB with chain names 'T' and 'P') 2) If we have chain names with repeated initial letters and the number of chains is lower or equal to the number of letters in the alphabet then we tag chains alphabetically (e.g. a structure with 3 chains called 'potato', tomato' and 'pickle' would be exported to PDB with chain names 'A', 'B' and 'C') 3) If the number of chains is greater than the number of letters in the alphabet then we tag everything as 'X'.

This way it makes sense to recommend the single letter implementation since single letters would be respected.

I noticed you have multiple values in the types field(['protein', 'ions']). How are these related to the chain? Are they simply a label or do you also want to have a standard meaning for each value. Does the ions field mean that the chain is charged or are there also atoms that are not part of the polypeptide marked as being part of this chain?

Some proteins have ions integrated in their structure. They are not covalently bonded but they are stable and usually critical for the protein structure/function. Here you have an example of a protein (ACE2) which has 2 ions (chlorine and zinc). When this happens, ions are usually tagged with the same chain that the protein.

I would let types as just optional labels. It is true making standard labels would be even more beneficial but I think this means more work since we would have to rewrite the standard every time someone needs a new label we did not think about. Residue names and atom elements should be enough for any application to understand what every chain contains exactly.

They also sometimes share information about how the simulation/measurement was done(force fields, packages etc.) at this calculations endpoint. So for now you could do the same.

Alright

If you feel happy about the proposed bio-fields you can make a draft PR. If you prefer I can also do it for you.

Great, I'll do it soon.

On the 25th of February at 5pm CET there is an Optimade meeting. If you like we could discuss your proposal there with some more people.

Sure, I'll join the meeting.

By the way, I have been reading the PDB remarks documentation. There are many things and, in our case, we are not going to use them. We could leave most of this metadata as optional database-specific fields in case someone needs this. About the molecule-specific metadata it is clear that we should add some sort of molecule grouping and Adam agrees with this. However, in our case, I do not have any use for this so the only thing I could suggest, if we follow the top-down structure, is a list of objects with a name and a list of chain indices. Even this idea does not suit me since, as I stated in previous comments, a chain may be anything, even several "molecules" according to the classic meaning of this word.

Residues and chains are widely used so it makes all the sense having these fields. Molecules, in the other hand, are not the most standard classifiers. Many tools have something called "molecules" but they may be different things in different tools. I am not sure if it makes sense having another biomol-specific grouping field. Maybe we could use your 'groups' and 'group_types' implementation to handle these cases when someone needs an extra grouping layer. Or we could even skip this part now and wait for someone who needs this to implement it.

JPBergsma commented 2 years ago

Atom count for me is useful to know how big is the structure so you can predict if you can handle it. For this reason I think atom count should refer to all explicit atoms in the system. If hydrogens are missing then they should not count. There could be several more specific atom counts (protein atom count, nucleic acid atom count, etc.) but I would let these fields to be implemented as database-specific fields since there are many interesting possibilities. Now I was wondering how to handle trajectories with non-constant atom count. Maybe atom count should refer to the maximum atom count along frames?

Would the nsites field be sufficient for this? It is the number of sites in the cartesian_site_positions. It is possible that a site has only a fractional occupation, (e.g., an atom could be at two different positions within the crystal structure, so there would, for example, be 0.5 atom in each site), but in general it should be the same as the number of atoms as you described it.

There are two ways in which you could handle a changing number of atoms.
The first would be to redefine nsites, cartesian_site_positions and species_at_sites for each frame where the number of atoms changes. This could however make it more difficult to follow the path of an atom through the simulation as the index of a particular atom changes from frame to frame.

The second method would be to set nsites to the total number of atoms that occur in the trajectory. This may be higher than the maximum number of atoms in a single frame, as it is the sum of the maximum number of atoms/sites of each species within any one of the frames. In case an atom is not in the simulation volume, you could set its cartesian_site_position to None/null. I think this may be the preferred method as it does allow tracking the particles. You would also not have to reset species_at_sites every time the number of atoms changes. So it probably would not increase the amount of sent data.

The main idea of the sequence types is to specify the kind of every sequence in the sequences field. This is useful to avoid ambiguous situations where it is not clear what a sequence is. For instance, if we have a short sequence with only 'A', 'G', 'C' and 'T' letters then it could be DNA or a small peptide since these letters may refer to both nucleotides or aminoacids. - **sequence_types**: A list of tags specifying the type of each sequence in the :property:sequencesfield. The type of a sequence is defined by its components (e.g. 'aminoacids').

Thanks,that makes it much clearer for me.

The way I would handle this is:

If we have chain names with unique initial letters then we use these letters (e.g. a structure with 2 chains called 'potato' and 'tomato' would be exported to PDB with chain names 'T' and 'P')

If we have chain names with repeated initial letters and the number of chains is lower or equal to the number of letters in the alphabet then we tag chains alphabetically (e.g. a structure with 3 chains called 'potato', tomato' and 'pickle' would be exported to PDB with chain names 'A', 'B' and 'C')

If the number of chains is greater than the number of letters in the alphabet then we tag everything as 'X'. This way it makes sense to recommend the single letter implementation since single letters would be respected.

Using the first letter of the chain names would indeed be nice. I assume that if there are more chains than letters in the alphabet chains with single letters would keep the letter?

Some proteins have ions integrated in their structure. They are not covalently bonded but they are stable and usually critical for the protein structure/function. Here you have an example of a protein (ACE2) which has 2 ions (chlorine and zinc). When this happens, ions are usually tagged with the same chain that the protein. I would let types as just optional labels. It is true making standard labels would be even more beneficial but I think this means more work since we would have to rewrite the standard every time someone needs a new label we did not think about. Residue names and atom elements should be enough for any application to understand what every chain contains exactly.

Thanks for explaining this. I think you could also explain the meaning of the labels at the info endpoint.

Great, I'll do it soon.

I see you just posted something. I will have a look at it. :thumbsup:

Sure, I'll join the meeting.

:+1: up: You can find the details about when and where the meeting takes place here: https://www.optimade.org/ We use JITSI https://meet.jit.si/ and the room code is “OPTIMADE”.

By the way, I have been reading the PDB remarks documentation. There are many things and, in our case, we are not going to use them. We could leave most of this metadata as optional database-specific fields in case someone needs this.

Yes, that seems a good enough solution for now.

About the molecule-specific metadata, it is clear that we should add some sort of molecule grouping, and Adam agrees with this. However, in our case, I do not have any use for this so the only thing I could suggest, if we follow the top-down structure, is a list of objects with a name and a list of chain indices. Even this idea does not suit me since, as I stated in previous comments, a chain may be anything, even several "molecules" according to the classic meaning of this word. Residues and chains are widely used so it makes all the sense having these fields. Molecules, in the other hand, are not the most standard classifiers. Many tools have something called "molecules" but they may be different things in different tools. I am not sure if it makes sense having another biomol-specific grouping field. Maybe we could use your 'groups' and 'group_types' implementation to handle these cases when someone needs an extra grouping layer. Or we could even skip this part now and wait for someone who needs this to implement it.

Yes, perhaps it is better to wait with defining a molecules field until we have a concrete use case.

d-beltran commented 2 years ago

Would the nsites field be sufficient for this?

Totally

There are two ways in which you could handle a changing number of atoms.

Perfect

Using the first letter of the chain names would indeed be nice. I assume that if there are more chains than letters in the alphabet chains with single letters would keep the letter?

Sure, it could be done like this

I think you could also explain the meaning of the labels at the info endpoint

Allright, I will update the PR

d-beltran commented 2 years ago

Last Friday meeting @sauliusg pointed that we were not handling atom names with the new biomol fields. Then @JPBergsma suggested that we could handle them with the species 'original_name' property.

I have been thinking about this and now I would like to suggest one last biomol field: 'biomol_atoms'. The idea of this field is handling any biomolecules atom-specific data not covered by species. This name is also more friendly than species for people from the biomolecular field.

In order to conserve the format in other biomol fields I would make a list of dictionaries.


_biomol_atoms
~~~~~~~~~~~~~

- **Description**: For each atom in the system there is a dictionary that describes this atom.
  This field is meant to store biomolecular atom-specific data which is not explicitly supported in the :field:`species`.
  Databases are allowed to add more properties as long as the properties are prefixed with the database specific prefix.
- **Type**: list of dictionaries with the properties:
   - :property:`name`: string (REQUIRED)
- **Requirements/Conventions**:
   - **Query**:  Support for queries on this property is OPTIONAL.
     If supported, only a subset of the filter features MAY be supported.
   - **name**: The atom name
   - The length of this list MUST match the length of the :field:`species` list.

- **Examples**:

.. code:: jsonc
  {
    "_biomol_atoms":[
      {
        "name": "N"
      },
      {
        "name": "CA"
      }
    ]
  }

What do you think?

rartino commented 2 years ago

@d-beltran Do you see a benefit with _biomol_atoms over adding _biomol_ fields inside species?

d-beltran commented 2 years ago

Hi Rickard,

I didn't know we had this possibility. It totally works for me if you prefer this way.

JPBergsma commented 2 years ago

I also was not sure whether this was allowed. Currently, there is nothing in the Optimade standard about custom nested fields. Perhaps it would be good to make a small change about this in the Optimade standard. Do you think that all the implementations of the database providers are already compatible with this ? I do agree that it would better to place the _biomol_atoms field within the species property. Perhaps using _biomol_atom_name would be a better name for this property, as _biomol_atoms sounds more generic, and I would therefore expect more properties to be stored under this field.

rartino commented 2 years ago

As I recall, being allowed to add provider-specific fields inside species was part of the original design idea. But, after looking through the text I agree that it isn't directly specified as allowed or forbidden. But, if you just put the standardization of _biomol_atom_name inside species in this PR, I suppose that clarifies our stance. (But maybe another PR should also put in specifially that provider-specific names are allowed here.)