PathwayCommons / factoid-converters

Web services for Factoid project to convert between JSON, BioPAX, SBGN data formats
http://biopax.baderlab.org/factoid-converters/
MIT License
2 stars 1 forks source link

Biofactoid data in PC 'beta' instance not mapped to non-BioPAX formats #27

Open jvwong opened 2 years ago

jvwong commented 2 years ago

Background

Currently, there is a 'beta' testing instance of cPath2 accessible at https://beta.pathwaycommons.org/ which is loaded with Pathway Commons v12 data in addition to data exported from Biofactoid.

Issue

In using the web service to retrieve Biofactoid-sourced pathway data in various formats (BioPAX, SIF, TXT, SBGN), I have noticed that in some cases, the non-BioPAX formats return no data.


Notes and clues

a. It seems like this issue is exclusively a problem with Biofactoid pathways involving non-human gene/gene products

I looked through a few of the pathways that didn't involve human genes, and it seems like these universally show the same bug.

b. For non-human pathways, the participants (i.e. proteins) all seem to possess ProteinReferences that reference a RelationshipXref, but never a UnificationXref.

metincansiper commented 2 years ago

b. For non-human pathways, the participants (i.e. proteins) all seem to possess ProteinReferences that reference a RelationshipXref, but never a UnificationXref.

@jvwong looks like in the unstable branch I made some updates to use UnificationXref for ggp entities and RelationshipXref for the other ones. These updates have not been merged to the master yet so in the instance that we used in build all of them was RelationshipXref.

Is it enough to use UnificationXref for only ggp entities as it is done in unstable branch now? If not, when to use RelationshipXref and when to use UnificationXref?

jvwong commented 2 years ago

@jvwong looks like in the unstable branch I made some updates to use UnificationXref for ggp entities and RelationshipXref for the other ones. These updates have not been merged to the master yet so in the instance that we used in build all of them was RelationshipXref.

Sounds like its worth a try. Is there any reason why the human pathways don't seem to have this problem (in the current instance/master), that is, they are mapping to UnificationXref correctly?

metincansiper commented 2 years ago

Sounds like its worth a try. Is there any reason why the human pathways don't seem to have this problem (in the current instance/master), that is, they are mapping to UnificationXref correctly?

@jvwong I think what happening in the master branch is that:

I wonder if the UnificationXrefs that you mention are the ones assigned to the organisms?

jvwong commented 2 years ago

OK let me know if you have a chance to rebuild the beta. I think this is the only issue required for a v13 release.

gbader commented 2 years ago

Did we fix the biofactoid metadata file (e.g. add logo and pubmed ID)?

jvwong commented 2 years ago

Did we fix the biofactoid metadata file (e.g. add logo and pubmed ID)?

I posted this one: https://github.com/PathwayCommons/cpath2/issues/313

jvwong commented 2 years ago

I updated the "Factoid binary interactions" Google Doc with some items on how to assign Xrefs (and subclasses), some of which is below:

On Xrefs for participants

Biofactoid helps assign external public database identifiers to molecular interaction participants (except for Complex) from ChEBI or NCBI Gene. This is via our grounding-search application.

For small molecules, it is reasonable to assign a UnificationXref to an entity reference (ChEBI).

For genes and their products, it is more appropriate to assign a RelationshipXref for the simple reason that physical entity types (RNA, PROTEIN) merely reference an underlying gene (locus), but are not identified by it per se. An exception could be made for ‘DNA’, as in these cases, the thing being referred to can be either a pseudogene locus or transposon locus. When it is possible to map an NCBI Gene record to UniProt, it can be deemed appropriate to assign a UnificationXref for two reasons: 1) UniProt folds (similar) alternative protein sequences from the same locus under the canonical sequence record 2) We are effectively assigning NCBI Gene records for the user through the grounding search top hit. These statements are summarized below:

Table: BioPAX Xref subtypes for Biofactoid interaction participants

ENTITY_TYPE DATABASE ID NCBI_GENE_TYPE Xref subclass
Chemical ChEBI n/a UnificationXref
GGP NCBI Gene ‘unknown’; ‘biological region’; ‘other’ RelationshipXref
DNA NCBI Gene 'pseudo', 'transposon' UnificationXref
RNA NCBI Gene 'tRNA', 'rRNA', 'snRNA', 'scRNA', 'snoRNA', 'miscRNA', 'ncRNA’ RelationshipXref
PROTEIN NCBI Gene ‘protein-coding’ RelationshipXref
UniProt/SwissProt n/a UnificationXref
COMPLEX - n/a n/a