We need to add various processed protein fragments to the protein tree as children of their respective proteins. These are visible on UniProt pages under "PTM/Processing" and "Features", e.g. https://www.uniprot.org/uniprotkb/P01308/entry#ptm_processing. They are available in machine-readable form as RDF or XML. The previous Arborist code used RDF, but XML and JSON are easier to fetch from UniProt.
The hardest part of this task is efficiently fetching the RDF/XML/JSON with this information. I'm not sure the best way to do this -- @danielmarrama knows more about UniProt API than I do.
We do not care about all the features that UniProt describes. We only care about these types (old code): "Chain", "Peptide", "Signal_Peptide", "Initiator_Methionine".
If a feature is the full length of the protein, then we do not care about it.
In order to insert these into the protein tree, they need subject CURIEs. Some fragments have PRO identifiers (not the Protein Ontology) but others do not. The RDF version of the data always had an identifier of some sort, but it's not worth using RDF just for that. When available, we should use the PRO IDs. Otherwise we should generate a distinct subject CURIE from the feature type, start, and end locations.
Fragments need labels. Some fragments have a "description" which should be its label. Otherwise a label should be generated from the fragment type, start, and end locations.
[I'm less sure about this part. Let's check in about it once the first part is complete.]
The RDF/XML/JSON for a protein may also contain synonyms that we should add to the protein tree.
To match the previous protein tree code, the names of the fragments should be copied to their parent as synonyms using "ONTIE:0003622" as the predicate.
We need to add various processed protein fragments to the protein tree as children of their respective proteins. These are visible on UniProt pages under "PTM/Processing" and "Features", e.g. https://www.uniprot.org/uniprotkb/P01308/entry#ptm_processing. They are available in machine-readable form as RDF or XML. The previous Arborist code used RDF, but XML and JSON are easier to fetch from UniProt.
The hardest part of this task is efficiently fetching the RDF/XML/JSON with this information. I'm not sure the best way to do this -- @danielmarrama knows more about UniProt API than I do.
We do not care about all the features that UniProt describes. We only care about these types (old code): "Chain", "Peptide", "Signal_Peptide", "Initiator_Methionine".
If a feature is the full length of the protein, then we do not care about it.
In order to insert these into the protein tree, they need subject CURIEs. Some fragments have PRO identifiers (not the Protein Ontology) but others do not. The RDF version of the data always had an identifier of some sort, but it's not worth using RDF just for that. When available, we should use the PRO IDs. Otherwise we should generate a distinct subject CURIE from the feature type, start, and end locations.
Fragments need labels. Some fragments have a "description" which should be its label. Otherwise a label should be generated from the fragment type, start, and end locations.
In the current protein tree, Cavia porcellus (guinea pig) protein looks like a nice, small example where the fragments are working properly.
Synonyms
[I'm less sure about this part. Let's check in about it once the first part is complete.]
The RDF/XML/JSON for a protein may also contain synonyms that we should add to the protein tree.
To match the previous protein tree code, the names of the fragments should be copied to their parent as synonyms using "ONTIE:0003622" as the predicate.