Open clancyoftheoverflow opened 2 years ago
This looks amazing!
@clancyoftheoverflow @davanstrien This dataset is live at https://huggingface.co/datasets/shamikbose89/clmet_3_1
@shamikbose thanks, I'll aim to review this today or tomorrow. @clancyoftheoverflow you probably know this dataset better than me, so feel free to also review it.
A URL for this dataset
http://fedora.clarin-d.uni-saarland.de/clmet/clmet.html
Dataset description
The Corpus of Late Modern English Texts, version 3.1 (CLMET3.1) is a principled collection of public domain texts drawn from various online archiving projects. In total, the corpus contains some 34 million words of running text. It incorporates CLMET, CLMETEV, and CLMET3.0, and has been compiled following roughly the same principles, that is:
The corpus covers the period 1710–1920, divided into three 70-year sub-periods. The texts making up the corpus have all been written by British and Irish authors who are native speakers of English. The corpus never contains more than three texts by the same author. The texts within each sub-period have been written by authors born within a correspondingly restricted sub-period.
Size: 34 million words
Annotation: PoS-tagged; genre.
Dataset modality
Text
Dataset licence
Creative Commons Attribution Non Commercial Share Alike 4.0 International
Other licence
No response
How can you access this data
As a download from a repository/website
Confirm the dataset has an open licence
Contact details for data custodian
No response