bigscience-workshop / lam

Libraries, Archives and Museums (LAM)
Apache License 2.0
82 stars 7 forks source link

bnl_ground_truth_newspapers_before_1878 #79

Open ymaurer opened 2 years ago

ymaurer commented 2 years ago

A URL for this dataset

https://data.bnl.lu/data/historical-newspapers/

Dataset description

33.000 transcribed text lines from historical newspapers (before 1878) along with the cropped images of the original scans

Text line based OCR 19.000 text lines in Antiqua 14.000 text lines in Fraktur Transcribed using double-keying (99.95% accuracy) Public Domain, CC0 (See copyright notice) Best for training an OCR engine

The newspapers used are:

Dataset modality

Mixed

Dataset licence

Creative Commons Public Domain Dedication and Certification

Other licence

No response

How can you access this data

As a download from a repository/website

size of dataset

500MB-2GB

Confirm the dataset has an open licence

Contact details for data custodian

opendata@bnl.etat.lu

ymaurer commented 2 years ago

I transformed the original dataset slightly into jsonl and zipped the images

https://huggingface.co/ymaurer/bnl_ground_truth_newspapers_before_1878

ymaurer commented 2 years ago

Moved the dataset to the biglam organisation biglam/bnl_ground_truth_newspapers_before_1878

davanstrien commented 2 years ago

Moved the dataset to the biglam organisation biglam/bnl_ground_truth_newspapers_before_1878

I think this got created as a model, so I've just moved it to a dataset. I think it could also be good to write a loading script for this to make the data easier to load using the datasets library. I'll hopefully have some time to help with that later this week.