PROconsortium / PRoteinOntology

Other
14 stars 3 forks source link

List of PRO terms representing all taxon-specific proteins #192

Closed nataled closed 4 years ago

nataled commented 4 years ago

We need to have a list of PRO terms that represent "all proteins in this organism". Such a list actually already exists and is provided with each release, but at the moment is incomplete. I will work on creating the list at release time (will be easy to do), but I wanted to find out what format to use. The existing file (taxbased_protein.dat) is just a single-column list of PRO terms, like:

PR:000000001
PR:000018263
PR:000029053
PR:000029031
PR:000029043
PR:000029065
PR:000029060
PR:000036194

A few questions: 1) @jz26 I recall this file was needed by you, or maybe it was @hongzhanhuang, for web site purposes. I'm not sure if that's still the case. Do you recall? Well, more to the point, is it still needed?

2) @chumingc I didn't quite catch what you said regarding what information you needed. Was it a mapping between the taxon ID and the PRO ID for such terms? And I think the name of...? Would this work:

NCBITaxon:9606PR:000029067WHICHEVER NAME YOU NEEDED

3) On the chance that taxbased_protein.dat is still needed, would it be best to modify that file according to the needs above, or to create a new file?

chumingc commented 4 years ago

I think this will work. We show people "Homo sapiens [9606]” and do the mapping behind the scene.

Homo sapiens [9606] PR:000029067

On Jul 17, 2020, at 3:29 PM, Darren A. Natale notifications@github.com wrote:

We need to have a list of PRO terms that represent "all proteins in this organism". Such a list actually already exists and is provided with each release, but at the moment is incomplete. I will work on creating the list at release time (will be easy to do), but I wanted to find out what format to use. The existing file (taxbased_protein.dat) is just a single-column list of PRO terms, like:

PR:000000001 PR:000018263 PR:000029053 PR:000029031 PR:000029043 PR:000029065 PR:000029060 PR:000036194 A few questions:

@jz26 https://github.com/jz26 I recall this file was needed by you, or maybe it was @hongzhanhuang https://github.com/hongzhanhuang, for web site purposes. I'm not sure if that's still the case. Do you recall? Well, more to the point, is it still needed?

@chumingc https://github.com/chumingc I didn't quite catch what you said regarding what information you needed. Was it a mapping between the taxon ID and the PRO ID for such terms? And I think the name of...? Would this work:

NCBITaxon:9606PR:000029067WHICHEVER NAME YOU NEEDED

On the chance that taxbased_protein.dat is still needed, would it be best to modify that file according to the needs above, or to create a new file? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PROconsortium/PRoteinOntology/issues/192, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7IWF54G34P3RD64SKNS4LR4CRAXANCNFSM4O6YOHAA.

jz26 commented 4 years ago

On Fri, Jul 17, 2020 at 3:29 PM Darren A. Natale notifications@github.com wrote:

We need to have a list of PRO terms that represent "all proteins in this organism". Such a list actually already exists and is provided with each release, but at the moment is incomplete. I will work on creating the list at release time (will be easy to do), but I wanted to find out what format to use. The existing file (taxbased_protein.dat) is just a single-column list of PRO terms, like:

PR:000000001 PR:000018263 PR:000029053 PR:000029031 PR:000029043 PR:000029065 PR:000029060 PR:000036194

A few questions:

1.

@jz26 https://github.com/jz26 I recall this file was needed by you, or maybe it was @hongzhanhuang https://github.com/hongzhanhuang, for web site purposes. I'm not sure if that's still the case. Do you recall? Well, more to the point, is it still needed?

I don't recall. Do you have more clues?

1. 2.

@chumingc https://github.com/chumingc I didn't quite catch what you said regarding what information you needed. Was it a mapping between the taxon ID and the PRO ID for such terms? And I think the name of...? Would this work:

NCBITaxon:9606PR:000029067WHICHEVER NAME YOU NEEDED

  1. On the chance that taxbased_protein.dat is still needed, would it be best to modify that file according to the needs above, or to create a new file?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PROconsortium/PRoteinOntology/issues/192, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMXZIL3DDOGN5NGZNTMVWPTR4CRAXANCNFSM4O6YOHAA .

nataled commented 4 years ago

@jz26 I believe it might have been used to suppress showing a giant list of children for the listed PRO terms.

nataled commented 4 years ago

Based on offline conversations, it has been confirmed that the existing file listing the terms of interest can be adapted to another use. I will let you know when it is ready.

nataled commented 4 years ago

@chumingc the format you suggested will not be easily parseable in all cases, as some taxon names have square brackets within. Instead, I'll produce the following:

taxon_nametaxon_numberPRO_identifier

You can find an example file at /home/dnatale/data/ontologies/for_release/taxbased_protein.dat on Hershey. Let me know if that's suitable. I can easily revise.

In the future, the file will be found at /data/pir/projects/pro/releaseNUM where NUM is the release number (next one being 61).