cannin / ihop-reach

A web application to access biological data extracted from biomedical literature.
https://reach.nrnb-docker.ucsd.edu
GNU Lesser General Public License v3.0
4 stars 4 forks source link

Setup Indra Locally for Data Generation #41

Open cannin opened 5 years ago

cannin commented 5 years ago

The Indra library will be used to process the full content of articles; set it up locally.

https://github.com/sorgerlab/indra

RohitChattopadhyay commented 5 years ago

I have installed it in my personal device. I have tried methods like reach.api.process_pubmed_abstract() which returned a json file like FRIES format. function reach.api.process_pubmed_abstract() is not completing, i have also tried in Google Colab. Thanks

cannin commented 5 years ago

Download the PubMed Central dataset:

ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz

This will have a CSV with the following columns:

Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMCID,PMID,Manuscript Id,Release Date

We need to create this for PubMed with these columns:

Journal Title,Year,DOI,PMCID,PMID

In a separate (different, but related dataset) dataset, it is broken into several files, here are samples:

ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline-2018-sample/

We will use Indra for this extraction, but the software can extract everything needed. Use get_metadata() from pubmed_client.py in Indra to extract the necessary information. Modify as necessary to extract the file. This will become a pull request.

Write your own separate code to use Indra and process the the files and create the necessary output also as a CSV.

RohitChattopadhyay commented 5 years ago

ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz

This will have a CSV with the following columns:

Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMCID,PMID,Manuscript Id,Release Date

We need to create this for PubMed with these columns: Journal Title,Year,DOI,PMCID,PMID

Simply removing the other columns from the original downloaded CSV using a script will do?

RohitChattopadhyay commented 5 years ago

Made a sample output using ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline-2018-sample/pubmed19n0654.xml

Requesting feedback Thanks

RohitChattopadhyay commented 5 years ago

Made a basic script to extract details and save in CSV file, link Output CSV file available here

RohitChattopadhyay commented 5 years ago

Link to source of the files: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline