PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

BIND: elements <organism> have human, homo sapiens names but taxonomy ID is not 9606 #193

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Data source: BIND (PSI-MI, converted to BioPAX L3 with Paxtools 4.3.0)
Date downloaded: 15-Dec-2010
Version, if available: 1_0
File location: 
http://download.baderlab.org/BINDTranslation/release1_0/PSIMI25_XML/taxid9606_PS
IMI25.xml

Data source issue:

There are interactors (PSI-MI, which are converted to BioPAX entity reference 
objects) where 'organism' has names "Human", "Homo sapiens", but taxonomy is 
not 9606.

This led to the following BioSource objects were created in the PC2 v5 db:
http://www.pathwaycommons.org/pc2/search?q=homo%20NOT%209606&type=BioSource

Another query/example (there are 1297 physical entities that refer to the above 
BioSources):
http://www.pathwaycommons.org/pc2/search?q=sapiens%20NOT%209606&type=physicalent
ity&datasource=bind 

They are all from BIND, because the next query not using the filter by data 
source returns the same 1297 hits:
http://www.pathwaycommons.org/pc2/search?q=sapiens%20NOT%209606&type=physicalent
ity

In the original data file, they have:

<organism ncbiTaxId="9770">
  <names>
    <shortLabel>Homo sapiens</shortLabel>
    <fullName>Homo sapiens</fullName>
  </names>
</organism>

All six "wrong" organisms there are: 9770, 32644, 62928, 10095, 32630, 13555.

Is the data source aware of the issue?
No, not reported yet.

We could deal with this issue in next PC2 releases by implementing a special 
cPath2 Cleaner (BindCleanerImpl), which would either remove "human" names or 
replace taxonomy IDs with 9606, but I am not sure which is the correct fix. A 
better option would be to contact BIND db (if it's still maintained).

Original issue reported on code.google.com by rod...@gmail.com on 17 Nov 2014 at 9:18

GoogleCodeExporter commented 9 years ago
Could you please ask Ruth to look into this?  Looks like many/most/all? are 
from converting data from the PDB to BIND.  Some of these are legitimate e.g. 
32630 is synthetic construct, so it could be a human protein synthetically 
created.  However, the biosource name should make the taxonomy ID.  Thanks, Gary

Original comment by gary.bad...@gmail.com on 18 Nov 2014 at 5:31

GoogleCodeExporter commented 9 years ago
Are all BIND files translated or only a subset or just the human file 
(taxid9606_PSIMI25.xml)?

I found additional wrong organisms :0, 1260, 1280, 4896, 6431, 8355, 9598, 
9615, 9913, 9986, 10090, 10116,10407, 11676, 12475

From the Human BIND file there are 653 records where the biosource is Homo 
sapiens and the taxid is not 9606 or the biosource is not Homo sapiens and the 
taxid is 9606.  Of those 551 originate from PDB.  In the BIND record the 
biosource is specified as Homo sapiens, taxid 9606.  When we updated the 
identifiers taxids got updated to reflect the protein/gene identifier 
associated with the interactor instead of using what was specified in the BIND 
record.  Unfortunately, the Taxon name did not get updated to reflect this 
change. 

Original comment by rr.weinb...@gmail.com on 19 Nov 2014 at 4:32

GoogleCodeExporter commented 9 years ago
Here, as the first message says, we're talking only about this file:  
http://download.baderlab.org/BINDTranslation/release1_0/PSIMI25_XML/taxid9606_PS
IMI25.xml
(aye, good to know that the rest of BIND data have the same issue).

Ruth, would you please generate a new fixed file any soon, if possible? ;)

Original comment by rod...@gmail.com on 19 Nov 2014 at 5:14

GoogleCodeExporter commented 9 years ago
Ok, as a quick "fix", I updated the psimi-converter to use human organism 
(BioSource object, taxonomy 9606, "Homo sapiens") with all those entities where 
organism name was "Homo sapiens" (or "human") but taxonomy ID wasn't 9606. 

I think it might work, because: a) the BIND data was claimed to be human data; 
b) if an experimental form of a protein wasn't human, the experiment was about 
to infer/prove a human PPI interaction (also the converter currently ignores 
<experimentalInteractorList> element anyway); c) the protein/gene identifiers 
in some cases actually belong either to human or multiple organisms 

But e.g. POLR2A "genbank identifier" (NCBI GI) 12781 (though must be 
"gi:12781"), "CAA43449" (is GenPept ID, though it's called "ensembl" there for 
some reason) had organism with taxID:9770 and name "Homo sapiens" in the 
PSI-MI, and is in fact not human...

Original comment by rod...@gmail.com on 12 Feb 2015 at 8:58

GoogleCodeExporter commented 9 years ago

Original comment by rod...@gmail.com on 13 Feb 2015 at 3:33

IgorRodchenkov commented 9 years ago

Well, won't fix (actually I reverted the previous fix attempt, where taxonomy ids were replaced with 9606 if name was "Homo sapiens", because participant's protein/gene identifiers were in fact not human...). This must be investigated and fixed in the original BIND data.