Closed JunHanStudy closed 5 months ago
Are articles of papers-PubMed.jsonl in s2orc_*.jsonl file? Thank you!
Hi, thanks for your interest!
Running the first step of the Pubmed pipeline should download three separate datasets from the Semantic Scholar API then merge them together.
The three datasets should be downloaded:
pubmed
contains article metadatas2orc
contains full-text articlesabstracts
contains abstracts for these articlesThey are then merged into two separate files:
Did you run the pipeline with download.sh
and can you locate these files?
Thanks! I ran download.sh and found these. It is a great repo as you also provided the way of getting source training data.
It is great to see you have done an open-source Medical LLM with SOTA performance. When I ran "python load.py --dataset papers --key_path keys.json". It outputs papers-PubMed.jsonl. But I cannot find any paper in this dataset. Only some basic info of each article. Anyone knows what's wrong? Thank you!