Living-with-machines / T-Res

A Toponym Resolution Pipeline for Digitised Historical Newspapers
https://living-with-machines.github.io/T-Res/
Other
7 stars 1 forks source link

Upload NER and ED formatted topres19th data #243

Closed mcollardanuy closed 1 year ago

kallewesterling commented 1 year ago

Were you thinking BL repo for this as well, @mcollardanuy ? I'm making sure that I'm keeping a list of datasets in progress for now!

mcollardanuy commented 1 year ago

Hi @kallewesterling, I was undecided between BL repo and Huggingface. The first one seems easier.

kallewesterling commented 1 year ago

I'd agree with that, especially if we start the process with the other datasets. We might as well keep 'em coming! :)

mcollardanuy commented 1 year ago

Also, given that the original dataset from which these are derived is on the BL repo already: https://bl.iro.bl.uk/concern/datasets/f3686eb9-4227-45cb-9acb-0453d35e6a03

mcollardanuy commented 1 year ago

Datasets for toponym recognition and disambiguation for nineteenth-century English newspapers

Description

We present datasets for the tasks of toponym recognition and toponym disambiguation, which are derived from "Dataset for Toponym Resolution in Nineteenth-Century English Newspapers" (DOI: https://doi.org/10.23636/r7d4-kw08). The toponym recognition dataset consists of two JSON files (ner_fine_train.json and ner_fine_dev.json), whereas the toponym disambiguation dataset is provided as a TSV file (linking_df_split.tsv).

Toponym recognition dataset

The toponym recognition dataset can be used to train a named entity recognition model (focusing on toponyms). The data is provided as two json files---one for training (consisting of 5216 training examples) and one for development (consisting of 1304 training examples)---in the JSON Lines format, where each line corresponds to a sentence.

Each sentence is a dictionary with three key-value pairs: id (a sentence identifier, consisting of the article number followed by underscore and the sentence number), tokens (the list of tokens into which the sentence has been split), and ner_tags (the list of annotations per token, in the BIO format). The length of tokens and ner_tags should therefore always be the same. See below an example of three lines from one of the JSON files, representing three annotated sentences:

  {"id":"3896239_29","ner_tags":["O","B-STREET","I-STREET","O","O","O","B-BUILDING","I-BUILDING","O","O","O","O","O","O","O","O","O","O"],"tokens":[",","Old","Millgate",",","to","the","Collegiate","Church",",","where","they","arrived","a","little","after","ten","oclock","."]}
  {"id":"8262498_11","ner_tags":["O","O","O","O","O","O","O","O","O","O","O","B-LOC","O","B-LOC","O","O","O","O","O","O"],"tokens":["On","the","'","JSth","November","the","ship","Santo","Christo",",","from","Monteveido","to","Cadiz",",","with","hides","and","copper","."]}
  {"id":"10715509_7","ner_tags":["O","O","O","B-LOC","O","O","O","O","O","O","O","O","O","O","O","O"],"tokens":["A","COACH","to","SOUTHAMPTON",",","every","morning","at","a","quarter","before","6",",","Sundays","excepted","."]}

The dataset is derived from the training set of an existing dataset ("Dataset for Toponym Resolution in Nineteenth-Century English Newspapers", DOI: https://doi.org/10.23636/r7d4-kw08), which is randomly split (at the sentence level and with a ratio of 0.8/0.2) into training and development. You can find more information about the original dataset in the paper "A Dataset for Toponym Resolution in Nineteenth-Century English Newspapers" (DOI: https://doi.org/10.5334/johd.56).

For example, sentence 19 in file 1218_Poole.tsv is originally:

#Text=Cary, of Ramsey, Hants, sent in a tender offering £lO more than Mr.
19-1    708-712 Cary    _   _   
19-2    712-713 ,   _   _   
19-3    714-716 of  _   _   
19-4    717-723 Ramsey  https://en.wikipedia.org/wiki/Romsey    LOC 
19-5    723-724 ,   _   _   
19-6    725-730 Hants   https://en.wikipedia.org/wiki/Hampshire LOC 
19-7    730-731 ,   _   _   
19-8    732-736 sent    _   _   
19-9    737-739 in  _   _   
19-10   740-741 a   _   _   
19-11   742-748 tender  _   _   
19-12   749-757 offering    _   _   
19-13   758-759 £   _   _   
19-14   759-761 lO  _   _   
19-15   762-766 more    _   _   
19-16   767-771 than    _   _   
19-17   772-774 Mr  _   _   
19-18   774-775 .   _   _   

And is converted into the following format:

{"id":"1218_19","ner_tags":["O","O","O","B-LOC","O","B-LOC","O","O","O","O","O","O","O","O","O","O","O","O"],"tokens":["Cary",",","of","Ramsey",",","Hants",",","sent","in","a","tender","offering","\u00a3","lO","more","than","Mr","."]}

Toponym disambiguation dataset

We also provide a dataset for training an entity disambiguation model (focusing on toponyms). The dataset consists of a single TSV file (linking_df_split.tsv), consisting of one document per row, with the following columns:

Finally, the TSV contains a set of columns which can be used to indicate how to split the dataset into training (train), development (dev), testing (test), or documents to leave out (left_out). We provide the following coulumns:

License

The datasets are released under open license CC-BY-NC-SA, available at https://creativecommons.org/licenses/by-nc-sa/4.0/.

Copyright notice

Newspaper data has been provided by Findmypast Limited from the British Newspaper Archive, a partnership between the British Library and Findmypast (https://www.britishnewspaperarchive.co.uk/).

Funding statement

This work was supported by Living with Machines (AHRC grant AH/S01179X/1) and The Alan Turing Institute (EPSRC grant EP/N510129/1). This project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and Cambridge, King's College London, East Anglia, Exeter, and Queen Mary University of London.

Dataset creators

Mariona Coll Ardanuy and Federico Nanni.

Cite

If you use these datasets, please cite the following paper:

Coll Ardanuy, Mariona, David Beavan, Kaspar Beelen, Kasra Hosseini, Jon Lawrence, Katherine McDonough, Federico Nanni, Daniel van Strien, and Daniel C. S. Wilson. 2022. “A Dataset for Toponym Resolution in Nineteenth-century English Newspapers”. Journal of Open Humanities Data 8 (0): 3.DOI: https://doi.org/10.5334/johd.56](https://hackmd.io/@RuIR-C3PSQuIhOl25pL2KA/Skjyt2pI2)

mcollardanuy commented 1 year ago

Hi @claireaustin01, we have created two datasets that are derived from an already published dataset (by us). You can find a first version of the Readme in the previous comment. I have added the same license and copyright notice as in the existing dataset: does that look good to you? Thank you!

claireaustin01 commented 1 year ago

@mcollardanuy please can you send me a copy of the dataset? Thanks a lot!

mcollardanuy commented 1 year ago

Yes, done, thank you!

mcollardanuy commented 1 year ago

Last version of the readme: https://hackmd.io/@RuIR-C3PSQuIhOl25pL2KA/Skjyt2pI2

mcollardanuy commented 1 year ago

Done: https://bl.iro.bl.uk/concern/datasets/ef537c70-87cb-495a-86c8-edffefa6bdc6