PROconsortium / PRoteinOntology

Other
13 stars 3 forks source link

Accessing gene name consistently via PIR ID #126

Open nataled opened 7 years ago

nataled commented 7 years ago

Hello,

We would like to be able to access the pombe gene name consistently given the PR:ID.

We have noticed that the name is usually included, but not in a specific filed. Is it possible to do this in any way?

Thanks

Val

Reported by: ValWood

nataled commented 7 years ago

Hi Val,

Can you provide a few examples of PRO terms that have the gene name somewhere? I can easily find them myself because I know which terms should have them, but it is possible they are also found in terms that won't necessarily have them. Well, put more pragmatically, do you see them in an inconsistent way in terms NOT marked with Category=gene or Category=organism-gene? If not then I have an idea of what we can do to mark them. I'm guessing you see the genes indicated in the synonym field, is that correct?

Best regards, Darren

Original comment by: nataled

nataled commented 7 years ago

Hi Darren,

Apologies for the very delayed response. This fell ff my radar, but Midori has looked out some examples to illustrate our problem.

Best,

Val


What I noticed is that terms marked Category=gene or Category=organism-gene have the gene name somewhere, but not always in exactly the same tag.

For Category=gene, sometimes the pombe gene name is present as the EXACT PRO-short-label synonym, but for others a gene name from a different species (often S. cerevisiae) is the EXACT synonyms and the pombe name is a RELATED synonym. Maybe that's not such a big deal if we should be concentrating on Category=organism-gene terms, but then we may have to review to see whether we have more to request.

examples: [Term] id: PR:000027499 name: mitogen-activated protein kinase HOG1 def: "A p38-like stress-activated mitogen-activated protein kinase that is a translation product of the yeast HOG1 gene or a 1:1 ortholog thereof." [PMID:10207620, PRO:CNA] comment: Category=gene. The gene HOG1 in S. pombe is named sty1. synonym: "HOG1" EXACT PRO-short-label [PRO:DNx] synonym: "MAP kinase spc1" EXACT [] synonym: "STY1" RELATED [] is_a: PR:000000001 ! protein

[Term] id: PR:000027605 name: mediator of replication checkpoint protein 1 def: "A protein that is a translation product of the Schizosaccharomyces pombe 972h- mrc1 gene or a 1:1 ortholog thereof." [PRO:CNA] comment: Category=gene. Requested by=PomBase. synonym: "DNA replication checkpoint mediator mrc1" EXACT [] synonym: "mrc1" EXACT PRO-short-label [PRO:DNx] is_a: PR:000000001 ! protein

For Category=organism-gene, most of the pombe terms have an EXACT PRO-short-label synonym consisting of the pombe gene name with the prefix "Spom-". I haven't yet spotted any pombe Category=organism-gene terms that don't have a PRO-short-label synonym, but I have found a few that don't use the standard pombe gene name. That would throw us off.

example - OK: [Term] id: PR:000027596 name: histone H3.3 (Schizosaccharomyces pombe) def: "A fungal histone H3.3 that is encoded in the genome of Schizosaccharomyces pombe." [PMID:11242054, PMID:20929775, PomBase:MAH] comment: Category=organism-gene. Requested by=PomBase. synonym: "Spom-hht3" EXACT PRO-short-label [PRO:DNx] is_a: PR:000027595 ! fungal histone H3.3 is_a: PR:000041293 ! core histone (Schizosaccharomyces pombe)

2 examples - standard gene name in PRO term name but not the short-label synonym: [Term] id: PR:000029999 name: DNA repair protein Crb2 (Schizosaccharomyces pombe) def: "A tumor suppressor p53-binding protein 1 that is encoded in the genome of Schizosaccharomyces pombe." [PRO:DAN] comment: Category=organism-gene. synonym: "DNA repair protein rhp9 (Schizosaccharomyces pombe)" EXACT [PRO:DNx] synonym: "Spom-TP53BP1" EXACT PRO-short-label [PRO:DNx] is_a: PR:000000001 ! protein

[Term] id: PR:000030002 name: serine/threonine-protein kinase cds1 (Schizosaccharomyces pombe) def: "A serine/threonine-protein kinase Chk2 that is encoded in the genome of Schizosaccharomyces pombe." [PRO:DAN] comment: Category=organism-gene. synonym: "Spom-CHEK2" EXACT PRO-short-label [PRO:DAN] is_a: PR:000000001 ! protein

example - the term name, definition, and PRO-short-label use the S.c. name, and the correct pombe name is only in another synonym: [Term] id: PR:O14216 name: DNA replication regulator sld2 (Schizosaccharomyces pombe 972h-) alt_id: PR:000027524 def: "A DNA replication regulator sld2 that is encoded in the genome of Schizosaccharomyces pombe 972h-." [PMID:11937031, PomBase:MAH] comment: Category=organism-gene. Requested by=PomBase. synonym: "DNA replication regulator drc1" EXACT [] synonym: "SPAC6B12.11" RELATED [] synonym: "Spom972h-SLD2" EXACT PRO-short-label [PRO:DNx] xref: UniProtKB:O14216 is_a: PR:000027523 ! DNA replication regulator sld2 is_a: PR:000029043 ! Schizosaccharomyces pombe 972h- protein

We also use quite a few PRO terms that are marked Category=organism-modification, and it would be really nice to be able to retrieve gene names for those as well. I think we could parse away the "Spom-" prefix and the "/[modification]" suffixes easily enough if that Spom-[gene-name]/[modification] syntax is consistent, and the PRO-short-label synonyms use standard gene names. But at the moment not all do.

OK: [Term] id: PR:000027516 name: transcriptional regulator prz1 unmodified form (Schizosaccharomyces pombe) def: "A transcriptional regulator prz1 unmodified form in Schizosaccharomyces pombe." [PMID:12637524, PomBase:MAH] comment: Category=organism-modification. Requested by=PomBase. synonym: "Spom-prz1/UnMod" EXACT PRO-short-label [PRO:DAN] is_a: PR:000000001 ! protein

correct name only in a synonym other than the PRO-short-label (related to example above): [Term] id: PR:000027526 name: DNA replication regulator sld2 unmodified form (Schizosaccharomyces pombe) def: "A DNA replication regulator sld2 unmodified form in Schizosaccharomyces pombe." [PMID:11937031, PomBase:MAH] comment: Category=organism-modification. Requested by=PomBase. synonym: "DNA replication regulator drc1 unmodified form (Schizosaccharomyces pombe)" EXACT [] synonym: "Spom-SLD2/UnMod" EXACT PRO-short-label [PRO:DAN] is_a: PR:000000001 ! protein


Original comment by: ValWood

nataled commented 7 years ago

Hi Val, Midori, The PRO-short-label is designed to give some indication of orthology, so whenever a pombe term is orthologous to a previously-existing term in PRO, the label reflects that (with, as you noted, Spom- prepended to it). The exception is when a term is defined based only on the encoding gene, in which case the actual gene name in that organism is used. Thus, it is not useful for your purpose. As things stand right now, you should be able to reliably grab organism-gene terms (the ones roughly equivalent to UniProtKB entries but made specifically to the species level). All of these have a line in the stanza with has_gene_template and the official PomBase identifier and gene name. I'm not sure why, but the stanza examples you provided above seem to lack this line. I verified that our downloads do have it. Can you tell me where you get your downloads from so I can track down the issue?

The only way to get the desired gene name from an organism-modification term at the moment would be to use our SPARQL interface. I think it should be possible to ask the question "what is the name of the gene given in the organism-gene ancestor of term X?" Are you familiar with the interface?

I've actually been considering making the labels based on the name of the gene in the given organism. I will raise the discussion with the consortium members.

Original comment by: nataled

nataled commented 7 years ago

Hi Val, Midori, I tought myself how to do SPARQL queries, at least enough to create the ones you need. If you go to the page http://pir.georgetown.edu/pro/pro_sparql.shtml there are a number of example queries. Each is designed to do some common database retrieval. The two at the bottom (#12 and #13) will be of most interest to you. #12 will return all PRO terms that represent proteins encoded by a particular gene of interest (which must be entered as a model organism database identifier). #13 is the one you most directly requested above. Given a PRO identifier, what gene encoded that protein? Directions for use are provided within the query (click "show query").

I hope this will be useful to you. It was fun making them! I know...I'm weird.

Original comment by: nataled

nataled commented 7 years ago

That's OK, we are all weird here too ;)

I'm going to link to this ticket on the relevent tickets on our tracker....our developer (Kim Rutherford) will then be able to assess whether this will do the trick...

Basically we are looking for a way to automate sensible display of PRO names on our Gene pages.

More later when we get to these tickets.

Thanks for looking into this for us.

Val

Original comment by: ValWood

nataled commented 7 years ago

FYI https://github.com/pombase/website/issues/67

Original comment by: ValWood

nataled commented 7 years ago

And, just to attack on two fronts, the change-over from the current orthology-based PRO-short-labels to the organism-specific gene-based labels has been approved. These should go live in our next release (which, just so you don't get too excited, has not yet been scheduled--figure about a month).

Original comment by: nataled