Open cannin opened 5 years ago
I have installed it in my personal device.
I have tried methods like reach.api.process_pubmed_abstract()
which returned a json file like FRIES format.
function reach.api.process_pubmed_abstract()
is not completing, i have also tried in Google Colab.
Thanks
Download the PubMed Central dataset:
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz
This will have a CSV with the following columns:
Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMCID,PMID,Manuscript Id,Release Date
We need to create this for PubMed with these columns:
Journal Title,Year,DOI,PMCID,PMID
In a separate (different, but related dataset) dataset, it is broken into several files, here are samples:
ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline-2018-sample/
We will use Indra for this extraction, but the software can extract everything needed. Use get_metadata() from pubmed_client.py in Indra to extract the necessary information. Modify as necessary to extract the file. This will become a pull request.
Write your own separate code to use Indra and process the the files and create the necessary output also as a CSV.
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz
This will have a CSV with the following columns:
Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMCID,PMID,Manuscript Id,Release Date
We need to create this for PubMed with these columns: Journal Title,Year,DOI,PMCID,PMID
Simply removing the other columns from the original downloaded CSV using a script will do?
Made a sample output using ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline-2018-sample/pubmed19n0654.xml
Requesting feedback Thanks
Link to source of the files: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline
The Indra library will be used to process the full content of articles; set it up locally.
https://github.com/sorgerlab/indra