Open albertvillanova opened 2 years ago
@cakiki give me a shout if you want any help with this? I am quite familiar with this dataset :)
@davanstrien You've already helped a lot with your script which I used to download all the data. I'm currently uploading all the .zip files to the hub which will probably take a while.
(For the record the download script is the following: https://github.com/Living-with-machines/hmd_newspaper_dl)
Done
Thanks a lot @cakiki!!!
I just left a comment to address this issue later:
This dataset takes too long to load because of the data format inferring. This is due to the compression with zip
and should be fixed if compressed with gzip
instead.
ds_name = "bigscience-catalogue-data/british_library_heritage_made_digital_newspapers"
ds = load_dataset(ds_name, split="train", streaming=True, use_auth_token=True)
@lhoestq, maybe we should warn about this in the docs?
Dataset came zipped. Should I convert everything to gzip?
Side question: what compression level would you recommend?
The dataset looks fine as ZIP, maybe we could optimize the data format inference so that it doesn't have to iterate over each single zip file. We can decide on a maximum number of files (possibly inside archives) to check for example ? WDYT @albertvillanova ?
PR to fix the issue of taking too long to iterate over all data files:
Need support for ZIP:
ds = load_dataset("bigscience-catalogue-data/british_library_heritage_made_digital_newspapers", split="train", streaming=True, use_auth_token=True)
item = next(iter(ds))
ERROR:
FileNotFoundError: Couldn't find a dataset script at huggingface/datasets/bigscience-catalogue-data/british_library_heritage_made_digital_newspapers/british_library_heritage_made_digital_newspapers.py or any data file in the same directory. Couldn't find 'bigscience-catalogue-data/british_library_heritage_made_digital_newspapers' on the Hugging Face Hub either: FileNotFoundError: No data files or dataset script found in bigscience-catalogue-data/british_library_heritage_made_digital_newspapers
I think the the loading script should parse the XML files.
CC: @davanstrien
I think the the loading script should parse the XML files.
CC: @davanstrien
I have a WIP script I have been working on for this. If it's helpful, I can share that? I am also working with some colleagues to get a plain text version of this dataset on the BL repository, but that will still take a bit longer to get ready.
Great @davanstrien !
You can do as you prefer...Maybe the fastest would be to get the script (to have the data available internally for the BigScience project). Eventually you could make the script publicly available either in a community dataset (in your org) or as a canonical dataset (opening a Pull Request in the lilbrary)...
Great - I will try and get the script finished today for use in BigScience. I might then hold off with a public script until we have the plain text version of the data available since that will be quicker to parse.
@albertvillanova, sorry this took a bit longer. I did write a loading script, but because the XML processing is relatively slow for this data, the loading script was very slow, and I think it would cause issues. I, therefore, pre-processed the data to extract the plain text and some minimal metadata. This is currently pushed to my HF hub (https://huggingface.co/datasets/davanstrien/hmd_newspapers)
Currently, each row represents an article in the newspaper. Since this is detected by an imperfect OCR segmentation tool from the digitised image, these articles are not always semantically meaningful. In particular, it can lead to very short or long articles. This could be dealt with quite easily later on, but I could also push a version of the data at the page level if this will be more efficient to use for the training (the lengths of the texts will be much longer for each example).
Either way, if you are happy with either of these approaches, I can transfer the dataset from my hub to the BigScience space.
availability:
license_text: No Copyright - Other Known Legal Restrictions
Use of this Item is not restricted by copyright and/or related rights. In one or more jurisdictions, laws other than copyright are known to impose restrictions on the use of this Item. Please refer to the organization that has made the Item available for more information. Notices
Unless expressly stated otherwise, the organization that has made this Item available makes no warranties about the Item and cannot guarantee the accuracy of this Rights Statement. You are responsible for your own use. You may find additional information about the copyright status of the Item on the website of the organization that has made the Item available. You may need to obtain other permissions for your intended use. For example, other rights such as publicity, privacy or moral rights may limit how you may use the material.
DISCLAIMER The purpose of this statement is to help the public understand how this Item may be used. When there is a (non-standard) License or contract that governs re-use of the associated Item, this statement only summarizes the effects of some of its terms. It is not a License, and should not be used to license your Work. To license your own Work, use a License offered at https://creativecommons.org/