bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Create dataset pmc_article #121

Closed albertvillanova closed 2 years ago

albertvillanova commented 2 years ago
albertvillanova commented 2 years ago

The PMC Article datasets are:

albertvillanova commented 2 years ago

Maybe we should create a Hub organization for all these datasets: pmc or pubmed_central.

albertvillanova commented 2 years ago
yjernite commented 2 years ago

what's the difference between this and https://github.com/bigscience-workshop/data_tooling/issues/74 ?

albertvillanova commented 2 years ago

@yjernite:

lvwerra commented 2 years ago

This is done, posted in #74: here.

Done: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_pmc

Sample:

{
'text':  "==== Front\nPLoS BiolPLoS BiolpbioplosbiolPLoS Biology1544-91731545-7885Public Library of Science San Francisco, USA 10.1371/journal.pbio.0000005Research ArticleGenetics/Genomics/Gene TherapyInfectious DiseasesMicrobiologyPlasmodiumThe Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum\n P. falciparum IDC TranscriptomeBozdech Zbynek ..."
'meta': "{'pmid': 12929205}"
}
albertvillanova commented 2 years ago

Thanks @lvwerra.