albertvillanova commented 2 years ago

uid: british_library_hertiage_made_digital_newspapers
type: primary
description:
- name: British Library Hertiage Made Digital Newspapers
- description: This is a collection of copyright-cleared 19th Century newspapers held by the British Library.
- homepage: https://bl.iro.bl.uk/collections/353c908d-b495-4413-b047-87236d2573e3?locale=en
- validated: True
languages:
- language_names:
- English
- language_comments:
- language_locations:
- Northern Europe
- United Kingdom
- validated: False
custodian:
- name: British Library Board
- in_catalogue:
- type: A library, museum, or archival institute
- location: United Kingdom
- contact_name: Daniel van Strien
- contact_email: davanstrien@gmail.com
- contact_submitter: False
- additional: https://www.bl.uk/
- validated: False
availability:
- procurement:
- for_download: Yes - it has a direct download link or links
- download_url: https://bl.iro.bl.uk/collections/353c908d-b495-4413-b047-87236d2573e3?locale=en
- download_email:
- licensing:
- has_licenses: Yes
- license_text: No Copyright - Other Known Legal Restrictions
  
  Use of this Item is not restricted by copyright and/or related rights. In one or more jurisdictions, laws other than copyright are known to impose restrictions on the use of this Item. Please refer to the organization that has made the Item available for more information. Notices
  
  Unless expressly stated otherwise, the organization that has made this Item available makes no warranties about the Item and cannot guarantee the accuracy of this Rights Statement. You are responsible for your own use. You may find additional information about the copyright status of the Item on the website of the organization that has made the Item available. You may need to obtain other permissions for your intended use. For example, other rights such as publicity, privacy or moral rights may limit how you may use the material.
  
  DISCLAIMER The purpose of this statement is to help the public understand how this Item may be used. When there is a (non-standard) License or contract that governs re-use of the associated Item, this statement only summarizes the effects of some of its terms. It is not a License, and should not be used to license your Work. To license your own Work, use a License offered at https://creativecommons.org/
- license_properties:
  - public domain
- license_list:
- pii:
- has_pii: Yes
- generic_pii_likely: somewhat likely
- generic_pii_list:
  - names
  - physical addresses
  - dates (birth, death, etc.)
- numeric_pii_likely: none
- numeric_pii_list:
- sensitive_pii_likely: somewhat likely
- sensitive_pii_list:
  - political opinions
  - trade-union membership
  - religious or philosophical beliefs
  - racial or ethnic origin
  - health-related data
  - data concerning a person's sex life or sexual orientation
- no_pii_justification_class:
- no_pii_justification_text:
- validated: False
source_category:
- category_type: collection
- category_web:
- category_media: news articles
- validated: False
media:
- category:
- text
- text_format:
- other
- ALTO XML
- audiovisual_format:
- image_format:
- .TIFF
- database_format:
- text_is_transcribed: Yes - image
- instance_type: A year of publications for a newspaper title
- instance_count: 100<n<1K
- instance_size: 100<n<10,000
- validated: False
fname: british_library_hertiage_made_digital_newspapers.json

cakiki commented 2 years ago

self-assign

davanstrien commented 2 years ago

@cakiki give me a shout if you want any help with this? I am quite familiar with this dataset :)

cakiki commented 2 years ago

@davanstrien You've already helped a lot with your script which I used to download all the data. I'm currently uploading all the .zip files to the hub which will probably take a while.

(For the record the download script is the following: https://github.com/Living-with-machines/hmd_newspaper_dl)

albertvillanova commented 2 years ago

https://huggingface.co/datasets/bigscience-catalogue-data/british_library_heritage_made_digital_newspapers

cakiki commented 2 years ago

Done

albertvillanova commented 2 years ago

Thanks a lot @cakiki!!!

I just left a comment to address this issue later:

This dataset takes too long to load because of the data format inferring. This is due to the compression with zip and should be fixed if compressed with gzip instead.

ds_name = "bigscience-catalogue-data/british_library_heritage_made_digital_newspapers"
ds = load_dataset(ds_name, split="train", streaming=True, use_auth_token=True)

@lhoestq, maybe we should warn about this in the docs?

cakiki commented 2 years ago

Dataset came zipped. Should I convert everything to gzip?

Side question: what compression level would you recommend?

lhoestq commented 2 years ago

The dataset looks fine as ZIP, maybe we could optimize the data format inference so that it doesn't have to iterate over each single zip file. We can decide on a maximum number of files (possibly inside archives) to check for example ? WDYT @albertvillanova ?

albertvillanova commented 2 years ago

PR to fix the issue of taking too long to iterate over all data files:

huggingface/datasets#3407

albertvillanova commented 2 years ago

Need support for ZIP:

huggingface/datasets#3375

ds = load_dataset("bigscience-catalogue-data/british_library_heritage_made_digital_newspapers", split="train", streaming=True, use_auth_token=True)
item = next(iter(ds))

albertvillanova commented 2 years ago

ERROR:


FileNotFoundError: Couldn't find a dataset script at huggingface/datasets/bigscience-catalogue-data/british_library_heritage_made_digital_newspapers/british_library_heritage_made_digital_newspapers.py or any data file in the same directory. Couldn't find 'bigscience-catalogue-data/british_library_heritage_made_digital_newspapers' on the Hugging Face Hub either: FileNotFoundError: No data files or dataset script found in bigscience-catalogue-data/british_library_heritage_made_digital_newspapers

albertvillanova commented 2 years ago

I think the the loading script should parse the XML files.

CC: @davanstrien

davanstrien commented 2 years ago

I think the the loading script should parse the XML files.

CC: @davanstrien

I have a WIP script I have been working on for this. If it's helpful, I can share that? I am also working with some colleagues to get a plain text version of this dataset on the BL repository, but that will still take a bit longer to get ready.

albertvillanova commented 2 years ago

Great @davanstrien !

You can do as you prefer...Maybe the fastest would be to get the script (to have the data available internally for the BigScience project). Eventually you could make the script publicly available either in a community dataset (in your org) or as a canonical dataset (opening a Pull Request in the lilbrary)...

davanstrien commented 2 years ago

Great - I will try and get the script finished today for use in BigScience. I might then hold off with a public script until we have the plain text version of the data available since that will be quicker to parse.

davanstrien commented 2 years ago

@albertvillanova, sorry this took a bit longer. I did write a loading script, but because the XML processing is relatively slow for this data, the loading script was very slow, and I think it would cause issues. I, therefore, pre-processed the data to extract the plain text and some minimal metadata. This is currently pushed to my HF hub (https://huggingface.co/datasets/davanstrien/hmd_newspapers)

Currently, each row represents an article in the newspaper. Since this is detected by an imperfect OCR segmentation tool from the digitised image, these articles are not always semantically meaningful. In particular, it can lead to very short or long articles. This could be dealt with quite easily later on, but I could also push a version of the data at the page level if this will be more efficient to use for the training (the lengths of the texts will be much longer for each example).

Either way, if you are happy with either of these approaches, I can transfer the dataset from my hub to the BigScience space.

bigscience-workshop / data_tooling

Create dataset british_library_hertiage_made_digital_newspapers #232

self-assign