🤗 Hugging Face x 🌸 BigScience initiative to create an open source, community resource of LAM datasets.
BigScience 🌸 is an open scientific collaboration of nearly 600 researchers from 50 countries and 250 institutions who collaborate on various projects within the natural language processing (NLP) space to broaden the accessibility of language datasets while working on challenging scientific questions around training language models.
We are running a datasets hackathon focused on making data from Libraries, Archives, and Museums (LAMS) with potential machine learning applications accessible via the Hugging Face Hub. You might also know this field as 'GLAM' - galleries, libraries, archives and museums.
We are doing this to help make these datasets more discoverable, open them up to new audiences, and help ensure that machine learning datasets more closely reflect the richness of human culture.
We aim to enable easy discovery and programmatic access to these datasets using Hugging Face's 🤗 Datasets Hub. As part of this, we want to:
Some of the reasons we think that this effort is important:
There is a growing interest in using language models with historical texts.[^histlms] Although we are not only focused on collecting datasets for this purpose, we hope that some of the materials we gather as part of this sprint will be helpful in efforts to train language models on historic text data.
There are a few ways to contribute to the hackathon:
To join the hackathon, start by introducing yourself on our GitHub discussion board https://github.com/bigscience-workshop/lam/discussions/19.
Once you have said hi on the discussion boards you should request to join BigLAM Hugging Face organization.
For guidance, please check out the Wiki.
If you have questions:
Initially we plan to run the hackathon until August 19th 2022. the end of October 2022.
[^ai4lam]: See for example, https://sites.google.com/view/ai4lam [^cordell]: R. Cordell, ‘Machine Learning + Libraries’, LC Labs. Accessed: Mar. 28, 2021. [Online]. Available: https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf, p.34 [^histlms]: Schweter, S., März, L., Schmid, K., & Çano, E. (2022). hmBERT: Historical Multilingual Language Models for Named Entity Recognition. ArXiv, abs/2205.15575., Manjavacas, E., & Fonteyn, L. (2022). Adapting vs. Pre-training Language Models for Historical Languages. Journal of Data Mining & Digital Humanities.