geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Define rules for imported model IDs and filenames #189

Closed dustine32 closed 11 months ago

dustine32 commented 2 years ago

We should discuss rules for model ID minting and file naming in the BioPAX pathway import process. Currently our two scenarios:

Should we add a prefix to these model IDs to help prevent ID collisions for data coming from multiple sources? (From @kltm: "so that any group could contribute without having to cross-check their IDs across all of our current IDs") For example, another MOD could at some point import their own pathway model for assimilatory sulfate reduction I using the same ID SO4ASSIM-PWY as YeastCyc. It would also aid in model file management: All YeastCyc models are models/YeastCyc_*.

A quick suggestion would be to use the full Pathway Xref ID from the BioPAX to supply the prefix:

  <bp:UnificationXref rdf:ID="UnificationXref126413">
    <bp:db rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Reactome</bp:db>
    <bp:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">R-HSA-350562</bp:id>

Would result in model ID = Reactome_R-HSA-350562, filename = Reactome_R-HSA-350562.ttl.

  <bp:UnificationXref rdf:ID="UnificationXref65101">
    <bp:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">SO4ASSIM-PWY</bp:id>
    <bp:db rdf:datatype="http://www.w3.org/2001/XMLSchema#string">YeastCyc</bp:db>
  </bp:UnificationXref>

Would result in model ID = YeastCyc_SO4ASSIM-PWY, filename = YeastCyc_SO4ASSIM-PWY.ttl.

Tagging @deustp01 @ukemi @vanaukenk @cmungall @kltm

kltm commented 2 years ago

I think there is something odd about allowing IDs to just sit in a global IDs space uncontrolled and using unstructured pseudo-English text. For internally generated IDs, we have an algorithm to guarantee non-colliding names across multiple minervas, but the imports is something we've not really talked about recently. I believe that either a non-colliding algorithm or UUID should be used, or there should be some other ruleset applied so that any group could contribute without having to cross-check their IDs across all of our current IDs. As well, I think there should be a rule for compactness, as "GLUCOSE-MANNOSYL-CHITO-DOLICHOL-GLUCOSE-MANNOSYL-CHITO-DOLICHOL.ttl" isn't great. Functionally, not using the algorithm or UUID would mean some kind of light namespacing, like Reactome (not necessarily a resource name echo "FOO-PWY" | md5sum | cut -f 1 -d ' ' | awk '{print "YP-" $0}').

dustine32 commented 1 year ago

From 2022-08-16 Alliance pathways call, we decided to prepend the YeastCyc prefix to the model ID and filename for YeastPathways. We will leave the Reactome ID/filename code alone (Ex: model ID=R-HSA-350562, filename=R-HSA-350562.ttl).

@suzialeksander @vanaukenk For the YeastCyc prefix, just confirming with you: @kltm and I would like to use the YeastCyc- prefix (containing a hyphen) rather than YeastCyc_ (with an underscore) in model IDs/filenames (ex: model ID=YeastCyc-SO4ASSIM-PWY, filename=YeastCyc-SO4ASSIM-PWY.ttl). Is this OK with you?

dustine32 commented 1 year ago

Fixed by #251.

suzialeksander commented 1 year ago

OK, for the YeastPathways import, our first choice would simply be SGD, so gomodel:SGD-SERSYN-PWY . Rather close second choice is YeastPathways, making it gomodel:YeastPathways-SERSYN-PWY .

Current name of 'YeastCyc` is not correct.

Thanks @dustine32

dustine32 commented 1 year ago

Thanks @suzialeksander! Anticipating the import of SGD standard annotations into Noctua, the existing gene-centric model ID convention is to use the MOD gene product ID, e.g., WB_WBGene00077700, MGI_MGI_99187, ZFIN_ZDB-GENE-020424-3. So, as long as there are no SGD gene product IDs that would conflict with the YeastPathway IDs, I think changing the YeastCyc- part of the new YeastPathways model IDs (and filenames) to SGD-. And I believe the YeastPathways IDs (220 of them) are already static so it should be possible to be confident about this.

@suzialeksander Can you confirm this conflict will not (or is unlikely to) occur and I can go ahead and change to SGD- prefix?

Oh crud. I just realized someone may not like the casual alternating of hyphen - and underscore _. @kltm

kltm commented 1 year ago

@dustine32 Yeah, I'm not wild about this, but I think at this point we have so much "variety" that it may not be worth trying to get the horse back in the stable. It would be good to work out a universal ruleset for different kinds of imports moving forward.

suzialeksander commented 1 year ago

Correct, none of these 220 should cause issues with SGD: gomodel:SGD-SERSYN-PWY or gomodel:SGD_SERSYN-PWY, whichever keeps the horse happy. SGDIDs should all be something like SGD:S000001855

suzialeksander commented 1 year ago

Decision: gomodel:YeastPathways_SERSYN-PWY @dustine32

suzialeksander commented 11 months ago

@kltm, this ticket was originally about defining rules. I don't think we've defined any rules, simply come up with a solution for this particular import. Do you want this ticket to be moved somewhere/kept open for discussion?

else OK to close.

kltm commented 11 months ago

I'll promise with @dustine32 that we won't forget. That said https://github.com/geneontology/pathways2GO/issues/189#issuecomment-1546238034 is the template: "pseudo-namespace_not-too-long-and-ideally-unique-id. The details can be worked out as we go as part of SOP.