EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

PUBMED (biomedical abstracts) #29

Closed thoppe closed 3 years ago

thoppe commented 3 years ago

PUBMED comprises of "more than 30 million citations for biomedical literature from MEDLINE, life science journals, and online books". The data are stored on a FTP, publicly accessible and free for public use. The compressed XML is about 28GB, although much of that is boilerplate. Though the metadata is useful for other purposes, it looks like The-Pile™ would benefit the most from the abstract and titles only.

Example

<ArticleTitle>Side chain packing below the fusion peptide strongly modulates triggering of the Hendra virus F protein.</ArticleTitle>

<AbstractText> Triggering of the Hendra virus fusion (F) protein is required to initiate the conformational changes which drive membrane fusion, but the factors which control triggering remain poorly understood. Mutation of a histidine predicted to lie near the fusion peptide to alanine greatly reduced fusion despite wild-type cell surface expression levels, while asparagine substitution resulted in a moderate restoration in fusion levels. Slowed kinetics of six-helix bundle formation, as judged by sensitivity to heptad repeat B-derived peptides, was observed for all H372 mutants. These data suggest that side chain packing beneath the fusion peptide is an important regulator of Hendra virus F triggering. </AbstractText>

Note that this is different from PMC (PubMed Central). That database contains full text of recent articles. While PubMed abstracts are comprehensive, they do not contain the full text. I'll submit a separate request for that dataset.

I've downloaded the entire set last year so I already have a crawler ready to go.

thoppe commented 3 years ago

Looking at the source code so far, it's a bit different of a download than what is already in place. For the baseline 2019 PubMed there are about 1000 different files gz zipped. Each file has a separate md5 checksum and the XML itself needs to be parsed to get down to the minimal text required.

thoppe commented 3 years ago

Would it be useful if I did all the parsing and simply provided a zipped CSV or XML file?

StellaAthena commented 3 years ago

I agree that we would only/primarily be interested in title + abstract.

Do you know if there are any titles that lack an abstract in the dataset? If so, I think those should be avoided. It would probably also be nice to skip the ones that are in PMC.

Would it be useful if I did all the parsing and simply provided a zipped CSV or XML file?

If you look at datasets.py you'll see what we are currently doing is pointing the code at an online archive of the data and then parsing it. If there is not a stable URL for the data currently, we can arrange to have it posted online. @leogao2 has done most of the processing work, but I believe it would be most convenient to have a tarball of the .txt files.

thoppe commented 3 years ago

Yes, there are absolutely abstracts without titles, and titles without abstracts. There is also a (very small) subset of non-English text in there too, fortunately there is a tag within the XML for language.

As for stable URL, it changes every year and there are about a thousand files to parse (and MD5 verify). I'm happy to do so though on a stable URL like I see within datasets.py. I see a line like:

git clone https://github.com/EleutherAI/pile_enron_emails .

Which I could do and include all my parsing code. How does that sound? This way I can do the filtering (no titles, English only, remove metadata) and leave you with something nice.

StellaAthena commented 3 years ago

That sounds perfect! Welcome to EleutherAI 😄

thoppe commented 3 years ago

Thanks! I'm working on that now. How much metadata should be included, or should it be as "real-text" as possible? For example, everything in PUBMED is indexed by a pmid #. Should this be removed? I'm guessing the entry should look like:

Side chain packing below the fusion peptide strongly modulates triggering of the Hendra virus F protein. Triggering of the Hendra virus fusion (F) protein is required to initiate the conformational changes which drive membrane fusion, but the factors which control triggering remain poorly understood. Mutation of a histidine predicted to lie near the fusion peptide to alanine greatly reduced fusion despite wild-type cell surface expression levels, while asparagine substitution resulted in a moderate restoration in fusion levels. Slowed kinetics of six-helix bundle formation, as judged by sensitivity to heptad repeat B-derived peptides, was observed for all H372 mutants. These data suggest that side chain packing beneath the fusion peptide is an important regulator of Hendra virus F triggering.

StellaAthena commented 3 years ago

It should all be real text. That sample looks good to me.

@leogao2 is the other major contributor, tagging him in case he has any guidance to provide.

thoppe commented 3 years ago

Completed and uploaded (link in the Discord). Reproduction code can be found here https://github.com/thoppe/The-Pile-PubMed

StellaAthena commented 3 years ago

Reopening as a reminder that the processing code needs to actually be put on the GitHub :)

thoppe commented 3 years ago

How do we want to manage this? The processing code is at https://github.com/thoppe/The-Pile-PubMed, and now includes a direct link to the data.

StellaAthena commented 3 years ago

We can either clone the repo into EleutherAI, or you can transfer ownership to the Org depending on if you want to keep a copy on your GitHub account.

To actually add the data to the Pile you need to add a PubMedDataset class to datasets.py and add the line (WikipediaDataset() , 1. ), to the list at line 10 of pile.py.

thoppe commented 3 years ago

I'm happy to transfer ownership to the Org. I'll work on a PR to add it in. Should the README be adjusted as well?

StellaAthena commented 3 years ago

Yup! pile.py generates the table and automatically adjusts the percentages for you.

leogao2 commented 3 years ago

I'll make sure all the right stuff is updated from here on out