kallewesterling commented 1 year ago

Were you thinking BL repo for this as well, @mcollardanuy ? I'm making sure that I'm keeping a list of datasets in progress for now!

mcollardanuy commented 1 year ago

Hi @kallewesterling, I was undecided between BL repo and Huggingface. The first one seems easier.

kallewesterling commented 1 year ago

I'd agree with that, especially if we start the process with the other datasets. We might as well keep 'em coming! :)

mcollardanuy commented 1 year ago

Also, given that the original dataset from which these are derived is on the BL repo already: https://bl.iro.bl.uk/concern/datasets/f3686eb9-4227-45cb-9acb-0453d35e6a03

mcollardanuy commented 1 year ago

Datasets for toponym recognition and disambiguation for nineteenth-century English newspapers

Description

We present datasets for the tasks of toponym recognition and toponym disambiguation, which are derived from "Dataset for Toponym Resolution in Nineteenth-Century English Newspapers" (DOI: https://doi.org/10.23636/r7d4-kw08). The toponym recognition dataset consists of two JSON files (ner_fine_train.json and ner_fine_dev.json), whereas the toponym disambiguation dataset is provided as a TSV file (linking_df_split.tsv).

Toponym recognition dataset

The toponym recognition dataset can be used to train a named entity recognition model (focusing on toponyms). The data is provided as two json files---one for training (consisting of 5216 training examples) and one for development (consisting of 1304 training examples)---in the JSON Lines format, where each line corresponds to a sentence.

Each sentence is a dictionary with three key-value pairs: id (a sentence identifier, consisting of the article number followed by underscore and the sentence number), tokens (the list of tokens into which the sentence has been split), and ner_tags (the list of annotations per token, in the BIO format). The length of tokens and ner_tags should therefore always be the same. See below an example of three lines from one of the JSON files, representing three annotated sentences:

  {"id":"3896239_29","ner_tags":["O","B-STREET","I-STREET","O","O","O","B-BUILDING","I-BUILDING","O","O","O","O","O","O","O","O","O","O"],"tokens":[",","Old","Millgate",",","to","the","Collegiate","Church",",","where","they","arrived","a","little","after","ten","oclock","."]}
  {"id":"8262498_11","ner_tags":["O","O","O","O","O","O","O","O","O","O","O","B-LOC","O","B-LOC","O","O","O","O","O","O"],"tokens":["On","the","'","JSth","November","the","ship","Santo","Christo",",","from","Monteveido","to","Cadiz",",","with","hides","and","copper","."]}
  {"id":"10715509_7","ner_tags":["O","O","O","B-LOC","O","O","O","O","O","O","O","O","O","O","O","O"],"tokens":["A","COACH","to","SOUTHAMPTON",",","every","morning","at","a","quarter","before","6",",","Sundays","excepted","."]}

The dataset is derived from the training set of an existing dataset ("Dataset for Toponym Resolution in Nineteenth-Century English Newspapers", DOI: https://doi.org/10.23636/r7d4-kw08), which is randomly split (at the sentence level and with a ratio of 0.8/0.2) into training and development. You can find more information about the original dataset in the paper "A Dataset for Toponym Resolution in Nineteenth-Century English Newspapers" (DOI: https://doi.org/10.5334/johd.56).

For example, sentence 19 in file 1218_Poole.tsv is originally:

#Text=Cary, of Ramsey, Hants, sent in a tender offering £lO more than Mr.
19-1    708-712 Cary    _   _   
19-2    712-713 ,   _   _   
19-3    714-716 of  _   _   
19-4    717-723 Ramsey  https://en.wikipedia.org/wiki/Romsey    LOC 
19-5    723-724 ,   _   _   
19-6    725-730 Hants   https://en.wikipedia.org/wiki/Hampshire LOC 
19-7    730-731 ,   _   _   
19-8    732-736 sent    _   _   
19-9    737-739 in  _   _   
19-10   740-741 a   _   _   
19-11   742-748 tender  _   _   
19-12   749-757 offering    _   _   
19-13   758-759 £   _   _   
19-14   759-761 lO  _   _   
19-15   762-766 more    _   _   
19-16   767-771 than    _   _   
19-17   772-774 Mr  _   _   
19-18   774-775 .   _   _

And is converted into the following format:

{"id":"1218_19","ner_tags":["O","O","O","B-LOC","O","B-LOC","O","O","O","O","O","O","O","O","O","O","O","O"],"tokens":["Cary",",","of","Ramsey",",","Hants",",","sent","in","a","tender","offering","\u00a3","lO","more","than","Mr","."]}

Toponym disambiguation dataset

We also provide a dataset for training an entity disambiguation model (focusing on toponyms). The dataset consists of a single TSV file (linking_df_split.tsv), consisting of one document per row, with the following columns:

article_id: acticle identifier, which consists of the number in the document file in the original dataset (for example, the article_id of 1218_Poole1860.tsv is 1218).

sentences: list of dictionaries, each dictionary corresponding to a sentence in the article, with two fields: sentence_pos (the position of the sentence in the article) and sentence_text (the text of the sentence). For example:

[
{
  'sentence_pos': 1, 
  'sentence_text': 'DUKINFIELD.  '
},
{
  'sentence_pos': 2, 
  'sentence_text': 'Knutsford Sessions.'
}, 
{
  'sentence_pos': 3, 
  'sentence_text': '—The servant girl, Eliza Ann Byrom, who stole a quantity of clothes from the house where she lodged, in Dukiafield, was sentenced to two months’ imprisonment. '
}
]

annotations: list of dictionaries containing the annotated data. Each dictionary corresponds to a named entity mentioned in the text, with the following fields: mention_pos (order of the mention in the article), mention (the actual mention), entity_type (the type of named entity), wkpd_url (the Wikipedia URL of the resolved entity), wkdt_qid (the Wikidata ID of the resolved entity), mention_start (the character start position of the mention in the sentence), mention_end (the character end position of the mention in the sentence), sent_pos (the sentence index in which the mention is found). For example:

[
{
  'mention_pos': 0, 
  'mention': 'DUKINFIELD', 
  'entity_type': 'LOC', 
  'wkpd_url': 'https://en.wikipedia.org/wiki/Dukinfield',
  'wkdt_qid': 'Q1976179', 
  'mention_start': 0, 
  'mention_end': 10, 
  'sent_pos': 1
}, 
{
  'mention_pos': 1, 
  'mention': 'Knutsford', 
  'entity_type': 'LOC',
  'wkpd_url': 'https://en.wikipedia.org/wiki/Knutsford',
  'wkdt_qid': 'Q1470791',
  'mention_start': 0, 
  'mention_end': 9, 
  'sent_pos': 2
}, 
{
  'mention_pos': 2, 
  'mention': 'Dukiafield', 
  'entity_type': 'LOC', 
  'wkpd_url': 'https://en.wikipedia.org/wiki/Dukinfield',
  'wkdt_qid': 'Q1976179', 
  'mention_start': 104, 
  'mention_end': 114, 
  'sent_pos': 3
}
]

place: A string with the place of publication. For example, "London".
decade: The decade when the article was published.
year: The year when the article was published.
ocr_quality_mean: OCR quality mean of the article.
ocr_quality_sd: OCR quality standart deviation of the article.
publication_title: Publication title where the article was published.
publication_code: Internal code of the publication where the article was published.
place_wqid: A string with the Wikidata ID of the place of publication. For example, if place is London UK, then place_wqid should be Q84.

Finally, the TSV contains a set of columns which can be used to indicate how to split the dataset into training (train), development (dev), testing (test), or documents to leave out (left_out). We provide the following coulumns:

originalsplit: The articles are divided into train and test according to the original dataset. Train is further split into train (0.66) and dev (0.33).
apply: The articles are divided into train and dev, with no articles left for testing. This split can be used to train the final entity disambiguation model, after the experiments.
withouttest: This split can be used for development. The articles in the test set of the original dataset are left out. The training set is split into train, development and test.
Ashton1860: The articles published in Ashton in the 1860s are used as the test set, the rest of the articles are used for training or development.
Dorchester1820: The articles published in Dorchester in the 1820s are used as the test set, the rest of the articles are used for training or development.
Dorchester1830: The articles published in Dorchester in the 1830s are used as the test set, the rest of the articles are used for training or development.
Dorchester1860: The articles published in Dorchester in the 1860s are used as the test set, the rest of the articles are used for training or development.
Manchester1780: The articles published in Manchester in the 1780s are used as the test set, the rest of the articles are used for training or development.
Manchester1800: The articles published in Manchester in the 1800s are used as the test set, the rest of the articles are used for training or development.
Manchester1820: The articles published in Manchester in the 1820s are used as the test set, the rest of the articles are used for training or development.
Manchester1830: The articles published in Manchester in the 1830s are used as the test set, the rest of the articles are used for training or development.
Manchester1860: The articles published in Manchester in the 1860s are used as the test set, the rest of the articles are used for training or development.
Poole1860: The articles published in Poole in the 1860s are used as the test set, the rest of the articles are used for training or development.

License

The datasets are released under open license CC-BY-NC-SA, available at https://creativecommons.org/licenses/by-nc-sa/4.0/.

Copyright notice

Newspaper data has been provided by Findmypast Limited from the British Newspaper Archive, a partnership between the British Library and Findmypast (https://www.britishnewspaperarchive.co.uk/).

Funding statement

This work was supported by Living with Machines (AHRC grant AH/S01179X/1) and The Alan Turing Institute (EPSRC grant EP/N510129/1). This project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and Cambridge, King's College London, East Anglia, Exeter, and Queen Mary University of London.

Dataset creators

Mariona Coll Ardanuy and Federico Nanni.

Cite

If you use these datasets, please cite the following paper:

Coll Ardanuy, Mariona, David Beavan, Kaspar Beelen, Kasra Hosseini, Jon Lawrence, Katherine McDonough, Federico Nanni, Daniel van Strien, and Daniel C. S. Wilson. 2022. “A Dataset for Toponym Resolution in Nineteenth-century English Newspapers”. Journal of Open Humanities Data 8 (0): 3.DOI: https://doi.org/10.5334/johd.56](https://hackmd.io/@RuIR-C3PSQuIhOl25pL2KA/Skjyt2pI2)

mcollardanuy commented 1 year ago

Hi @claireaustin01, we have created two datasets that are derived from an already published dataset (by us). You can find a first version of the Readme in the previous comment. I have added the same license and copyright notice as in the existing dataset: does that look good to you? Thank you!

claireaustin01 commented 1 year ago

@mcollardanuy please can you send me a copy of the dataset? Thanks a lot!

mcollardanuy commented 1 year ago

Yes, done, thank you!

mcollardanuy commented 1 year ago

Last version of the readme: https://hackmd.io/@RuIR-C3PSQuIhOl25pL2KA/Skjyt2pI2

mcollardanuy commented 1 year ago

Done: https://bl.iro.bl.uk/concern/datasets/ef537c70-87cb-495a-86c8-edffefa6bdc6

Living-with-machines / T-Res

Upload NER and ED formatted topres19th data #243

Datasets for toponym recognition and disambiguation for nineteenth-century English newspapers

Description

Toponym recognition dataset

Toponym disambiguation dataset

License

Copyright notice

Funding statement

Dataset creators

Cite