bigscience-workshop / lam

Libraries, Archives and Museums (LAM)
Apache License 2.0
82 stars 7 forks source link

Add dataset: early_printed_books_font_detection #45

Open davanstrien opened 2 years ago

davanstrien commented 2 years ago

A URL for this dataset

https://zenodo.org/record/3366686

Dataset description

This dataset is composed of photos of various resolution of 35'623 pages of printed books dating from the 15th to the 18th century. Each page has been attributed by experts from one to five labels corresponding to the font groups used in the text, with two extra-classes for non-textual content and fonts not present in the following list: Antiqua, Bastarda, Fraktur, Gotico Antiqua, Greek, Hebrew, Italic, Rotunda, Schwabacher, and Textura.

This dataset offers an image classification dataset that has potential implications for other downstream tasks such as OCR recognition.

A related paper Dataset of Pages from Early Printed Books with Multiple Font Groups

Dataset modality

Image

Dataset licence

Creative Commons Attribution Non Commercial Share Alike 4.0 International

Other licence

No response

How can you access this data

As a download from a repository/website

Confirm the dataset has an open licence

Contact details for data custodian

No response

davanstrien commented 2 years ago

Whilst this dataset should be fairly easy to add to the datasets hub, it is quite large, so you should be aware of this.

davanstrien commented 2 years ago

self-assign