epfLLM / meditron

Meditron is a suite of open-source medical Large Language Models (LLMs).
https://huggingface.co/epfl-llm
Apache License 2.0
1.77k stars 159 forks source link

Cannot find article data in papers-PubMed.jsonl #25

Closed JunHanStudy closed 5 months ago

JunHanStudy commented 6 months ago

It is great to see you have done an open-source Medical LLM with SOTA performance. When I ran "python load.py --dataset papers --key_path keys.json". It outputs papers-PubMed.jsonl. But I cannot find any paper in this dataset. Only some basic info of each article. Anyone knows what's wrong? Thank you!

JunHanStudy commented 6 months ago

Are articles of papers-PubMed.jsonl in s2orc_*.jsonl file? Thank you!

AGBonnet commented 5 months ago

Hi, thanks for your interest!

Running the first step of the Pubmed pipeline should download three separate datasets from the Semantic Scholar API then merge them together.

The three datasets should be downloaded:

They are then merged into two separate files:

Did you run the pipeline with download.sh and can you locate these files?

JunHanStudy commented 5 months ago

Thanks! I ran download.sh and found these. It is a great repo as you also provided the way of getting source training data.