biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
113 stars 20 forks source link

load pathway.reactome data from Reactome directly #27

Closed newgene closed 6 years ago

newgene commented 6 years ago

Currently pathway.reactome data is loaded from ConsensusPathDB. A user reported that the Reactome data might not be up-to-date:

https://mygene.info/v3/query?q=pathway.reactome.id:R-HSA-983712

returns 209 gene hits (that is 206 proteins if counting by "uniprot.Swiss-Prot" field)

While Reactome website reports 184 proteins in this pathway:

https://reactome.org/PathwayBrowser/#/R-HSA-983712&DTAB=MT

Need to investigate why the difference, and decide if we need to load Reactome data directly from Reactome.

sirloon commented 6 years ago

In order to keep current data structure ("reactome" key under "pathway"), we need to implement a merger which would merge data at different level within the document. Currently, mergers just merge data at root level. Here, part of "pathway" structure would come from consensusDB (eg. pathway.kegg) but pathway.reactome would come from a dedicated data source:

d1 = {"pathway":{"kegg":1}} # from consensus d2 = {"pathway":{"reactome":1}} # from reactome => merge at "pathway" level dfinal = {"pathway" : {"kegg":1}, {"reactome":1}}

sirloon commented 6 years ago

@newgene commit a9a1cbb099e5144f7724f4d354e76a6788aa3393 and 42f0faf8418cee06583b64464c8bab451a9c920c implements dumper+uploader for reactome data. I took NCBI2Reactome_All_Levels.txt file from this page https://reactome.org/download-data (link "NCBI to All pathways") as it was the one most closed to current data (eg. gene 10 has 4 records = 4 lines in this file). Can you confirm ?

newgene commented 6 years ago

@sirloon yes, that's right.

sirloon commented 6 years ago

fixed, pushed to prod as of May 16th, with commits a9a1cbb099e5144f7724f4d354e76a6788aa3393, 42f0faf8418cee06583b64464c8bab451a9c920c and 5e6ade930f971ffc0f8c79fedd2359a1c6082fe2