cannin / enhance_nlp_interaction_network_gsoc2020

3 stars 4 forks source link

Extract Data for Reactome Articles (MeSH Terms, Journal) #6

Open cannin opened 4 years ago

cannin commented 4 years ago

@PritiShaw can you grab the MeSH terms for the PMIDs in this file?

https://reactome.org/download/current/ReactionPMIDS.txt

There are many duplicates, so make a unique list. To the same exercise as before MTI and also from PubMed API.

cannin commented 4 years ago

@PritiShaw can you grab: MeSH, publication year (PubDate), journal (ISOAbbreviation), PMC ID (ArticleId IdType="pmc", if exists)

PritiShaw commented 4 years ago

Hi Mentor I have implemented the suggestions you gave regarding file format and headers. Please find the complete result at All PMID output, it has around 21,000 PMIDs I have also made a Truncated output ~3000 PMIDs so that Github can present the data.

Adding 5 PMIDs for your reference PMID JOURNAL_TITLE YEAR PMCID MESH_TERMS
10021361 Curr. Biol. 1999 Humans,SLP-76 signal Transducing adaptor proteins,Phosphoproteins,Signal Transduction,GRB2 protein, human,GRB2 Adaptor Protein,SH3 Domains,Receptors, Antigen, T-Cell,Carrier Proteins,Phosphorylation,Nuclear Proteins,DNA-Binding Proteins,Membrane Proteins,Jurkat Cells,NFATC Transcription Factors,Binding Sites,*Hematopoietic System,Amino Acid Sequence,Tyrosine
10022829 EMBO J. 1999 PMC1171179 Mice,Animals,laminin A,Laminin,perlecan,Dystroglycans,Heparin,Sulfoglycosphingolipids,Heparan Sulfate Proteoglycans,fibulin 2,Extracellular Matrix Proteins,nidogen,Membrane Glycoproteins,gephyrin,Calcium-Binding Proteins,Heparitin Sulfate,Protein Binding,Binding Sites,Recombinant Proteins,Basement Membrane
10022833 EMBO J. 1999 PMC1171183 Stem Cell Factor,GRB2 protein, human,GRB2 Adaptor Protein,SH3 Domains,Signal Transduction,Suppressor of Cytokine Signaling Proteins,Phosphorylation,Receptor Protein-Tyrosine Kinases,Proto-Oncogene Proteins c-kit,Proto-Oncogene Proteins c-vav,Protein Binding,Tyrosine,Cell Proliferation
10022860 Mol. Cell. Biol. 1999 PMC83966 Rats,Mice,Animals,ral Guanine Nucleotide Exchange Factor,PC12 Cells,ras Proteins,Signal Transduction,PI3-Kinase,Guanine Nucleotide Exchange Factors,ral GTP-Binding Proteins,RTN4 protein, human,*Nogo Proteins,NGF protein, human,Nerve Growth Factor,rho GTPases,ras Guanine Nucleotide Exchange Factors,Proto-Oncogene Proteins c-raf,Cell Differentiation,Neurite Outgrowth
10022869 Mol. Cell. Biol. 1999 PMC83975 Transforming Growth Factor beta,Transcriptional Activation,Transcription Factor AP-1,SMAD3 protein, human,Smad3 Protein,SMAD4 protein, human,Smad4 Protein,Promoter Regions, Genetic,Trans-Activators,Proto-Oncogene Proteins c-jun,Gene Expression Regulation,Transcription, Genetic,Binding Site,*Cell Nucleus,Genes, Reporter,Protein Binding,Transfection,Luciferases

Thanks

cannin commented 4 years ago

@PritiShaw i can make use of them as is to write some code. but can you re-run them and put a "|" between mesh terms. for example, this one has a comma that that will confuse a split:

IGFBP3 protein, human https://meshb.nlm.nih.gov/record/ui?ui=C515497

this would be safer: Transforming Growth Factor beta|Transcriptional Activation|Humans

cannin commented 4 years ago

I only get 2821 rows for the file: https://gist.github.com/PritiShaw/9ad43241c99f727afd04efbe0bdb77e8. Is it truncated?

wc -l all.tsv
    2821 all.tsv
PritiShaw commented 4 years ago

I only get 2821 rows for the file: https://gist.github.com/PritiShaw/9ad43241c99f727afd04efbe0bdb77e8. Is it truncated?

I checked the revision history I think it was truncated because I used the Github UI

I have pushed the complete version with | as the separator for MESH terms You can find the complete result here https://gist.github.com/PritiShaw/9ad43241c99f727afd04efbe0bdb77e8 There are total 20667 PMIDs

Thanks