Cannot find article data in papers-PubMed.jsonl

JunHanStudy commented 6 months ago

It is great to see you have done an open-source Medical LLM with SOTA performance. When I ran "python load.py --dataset papers --key_path keys.json". It outputs papers-PubMed.jsonl. But I cannot find any paper in this dataset. Only some basic info of each article. Anyone knows what's wrong? Thank you!

JunHanStudy commented 6 months ago

Are articles of papers-PubMed.jsonl in s2orc_*.jsonl file? Thank you!

AGBonnet commented 5 months ago

Hi, thanks for your interest!

Running the first step of the Pubmed pipeline should download three separate datasets from the Semantic Scholar API then merge them together.

The three datasets should be downloaded:

pubmed contains article metadata
s2orc contains full-text articles
abstracts contains abstracts for these articles

They are then merged into two separate files:

Abstracts with metadata are stored in /data/abstracts-PubMed_metadata.jsonl
Full-text articles with metadata are stored in /data/s2orc-PubMed_metadata.jsonl

Did you run the pipeline with download.sh and can you locate these files?

JunHanStudy commented 5 months ago

Thanks! I ran download.sh and found these. It is a great repo as you also provided the way of getting source training data.

epfLLM / meditron

Cannot find article data in papers-PubMed.jsonl #25