bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Create dataset british_library_hertiage_made_digital_newspapers #232

Open albertvillanova opened 2 years ago

albertvillanova commented 2 years ago
cakiki commented 2 years ago

self-assign

davanstrien commented 2 years ago

@cakiki give me a shout if you want any help with this? I am quite familiar with this dataset :)

cakiki commented 2 years ago

@davanstrien You've already helped a lot with your script which I used to download all the data. I'm currently uploading all the .zip files to the hub which will probably take a while.

(For the record the download script is the following: https://github.com/Living-with-machines/hmd_newspaper_dl)

albertvillanova commented 2 years ago

https://huggingface.co/datasets/bigscience-catalogue-data/british_library_heritage_made_digital_newspapers

cakiki commented 2 years ago

Done

albertvillanova commented 2 years ago

Thanks a lot @cakiki!!!

I just left a comment to address this issue later:

This dataset takes too long to load because of the data format inferring. This is due to the compression with zip and should be fixed if compressed with gzip instead.

ds_name = "bigscience-catalogue-data/british_library_heritage_made_digital_newspapers"
ds = load_dataset(ds_name, split="train", streaming=True, use_auth_token=True)

@lhoestq, maybe we should warn about this in the docs?

cakiki commented 2 years ago

Dataset came zipped. Should I convert everything to gzip?

Side question: what compression level would you recommend?

lhoestq commented 2 years ago

The dataset looks fine as ZIP, maybe we could optimize the data format inference so that it doesn't have to iterate over each single zip file. We can decide on a maximum number of files (possibly inside archives) to check for example ? WDYT @albertvillanova ?

albertvillanova commented 2 years ago

PR to fix the issue of taking too long to iterate over all data files:

albertvillanova commented 2 years ago

Need support for ZIP:

ds = load_dataset("bigscience-catalogue-data/british_library_heritage_made_digital_newspapers", split="train", streaming=True, use_auth_token=True)
item = next(iter(ds))
albertvillanova commented 2 years ago

ERROR:


FileNotFoundError: Couldn't find a dataset script at huggingface/datasets/bigscience-catalogue-data/british_library_heritage_made_digital_newspapers/british_library_heritage_made_digital_newspapers.py or any data file in the same directory. Couldn't find 'bigscience-catalogue-data/british_library_heritage_made_digital_newspapers' on the Hugging Face Hub either: FileNotFoundError: No data files or dataset script found in bigscience-catalogue-data/british_library_heritage_made_digital_newspapers
albertvillanova commented 2 years ago

I think the the loading script should parse the XML files.

CC: @davanstrien

davanstrien commented 2 years ago

I think the the loading script should parse the XML files.

CC: @davanstrien

I have a WIP script I have been working on for this. If it's helpful, I can share that? I am also working with some colleagues to get a plain text version of this dataset on the BL repository, but that will still take a bit longer to get ready.

albertvillanova commented 2 years ago

Great @davanstrien !

You can do as you prefer...Maybe the fastest would be to get the script (to have the data available internally for the BigScience project). Eventually you could make the script publicly available either in a community dataset (in your org) or as a canonical dataset (opening a Pull Request in the lilbrary)...

davanstrien commented 2 years ago

Great - I will try and get the script finished today for use in BigScience. I might then hold off with a public script until we have the plain text version of the data available since that will be quicker to parse.

davanstrien commented 2 years ago

@albertvillanova, sorry this took a bit longer. I did write a loading script, but because the XML processing is relatively slow for this data, the loading script was very slow, and I think it would cause issues. I, therefore, pre-processed the data to extract the plain text and some minimal metadata. This is currently pushed to my HF hub (https://huggingface.co/datasets/davanstrien/hmd_newspapers)

Currently, each row represents an article in the newspaper. Since this is detected by an imperfect OCR segmentation tool from the digitised image, these articles are not always semantically meaningful. In particular, it can lead to very short or long articles. This could be dealt with quite easily later on, but I could also push a version of the data at the page level if this will be more efficient to use for the training (the lengths of the texts will be much longer for each example).

Either way, if you are happy with either of these approaches, I can transfer the dataset from my hub to the BigScience space.