Open ymaurer opened 2 years ago
I transformed the original dataset slightly into jsonl and zipped the images
https://huggingface.co/ymaurer/bnl_ground_truth_newspapers_before_1878
Moved the dataset to the biglam organisation biglam/bnl_ground_truth_newspapers_before_1878
Moved the dataset to the biglam organisation biglam/bnl_ground_truth_newspapers_before_1878
I think this got created as a model, so I've just moved it to a dataset. I think it could also be good to write a loading script for this to make the data easier to load using the datasets
library. I'll hopefully have some time to help with that later this week.
A URL for this dataset
https://data.bnl.lu/data/historical-newspapers/
Dataset description
33.000 transcribed text lines from historical newspapers (before 1878) along with the cropped images of the original scans
Text line based OCR 19.000 text lines in Antiqua 14.000 text lines in Fraktur Transcribed using double-keying (99.95% accuracy) Public Domain, CC0 (See copyright notice) Best for training an OCR engine
The newspapers used are:
Dataset modality
Mixed
Dataset licence
Creative Commons Public Domain Dedication and Certification
Other licence
No response
How can you access this data
As a download from a repository/website
size of dataset
500MB-2GB
Confirm the dataset has an open licence
Contact details for data custodian
opendata@bnl.etat.lu