bigscience-workshop / lam

Libraries, Archives and Museums (LAM)
Apache License 2.0
81 stars 7 forks source link

Add dataset: contentious_contexts_corpus #68

Closed davanstrien closed 2 years ago

davanstrien commented 2 years ago

A URL for this dataset

https://github.com/cultural-ai/ConConCor

Dataset description

The Contentious Contexts Corpus dataset. This project was carried out in the context of the EuropeanaTech Challenge for Europeana Artificial Intelligence and Machine Learning datasets.

This dataset contains extracts from historical newspapers which have been containing keywords of potentially contentious words (according to present-day sensibilities). The dataset contains multiple annotations per instance, given the option to quantify agreement scores for annotations.

This dataset is potentially helpful in exploring how well machine learning methods can identify contentious use of words in historical texts (or predict the degree to which human annotators will find these words contentious).

The dataset also contains a label for 'Unreadable OCR' given this dataset a secondary function as a dataset for evaluating machine learning methods for classifying whether a particular piece of OCR text will be deemed unreadable by a human annotator.

Dataset modality

Text

Dataset licence

No response

Other licence

CC-BY 2.0

How can you access this data

As a download from a repository/website

Confirm the dataset has an open licence

Contact details for data custodian

No response

shamikbose commented 2 years ago

self-assign

shamikbose commented 2 years ago

For this dataset, I was thinking of using a structure as follows:

extract_id text target annotator_responses
H-99 Hollandsche IJzeren Spoorweg-Maatschappij een vijftal rijtuigen, 𝙜𝙚𝙢𝙚𝙣𝙜𝙙 eerste en tweede klasse, die zóó schoon zijn afgewerkt, dat zij m gemengd [{annotator_id: [response, suggestion]}]

Does this work, @davanstrien? P.S. As a person who does not speak Dutch, is there something I should know before I process this dataset?

davanstrien commented 2 years ago

For this dataset, I was thinking of using a structure as follows: extract_id text target annotator_responses H-99 Hollandsche IJzeren Spoorweg-Maatschappij een vijftal rijtuigen, 𝙜𝙚𝙢𝙚𝙣𝙜𝙙 eerste en tweede klasse, die zóó schoon zijn afgewerkt, dat zij m gemengd [{annotator_id: [response, suggestion]}]

Does this work, @davanstrien?

I think that makes sense. We could do some processing to generate an annotator agreement score, but I think this is up to the end user to decide how to do this.

P.S. As a person who does not speak Dutch, is there something I should know before I process this dataset?

The only thing I was going to suggest was putting the column names in English, but I think you already worked those all out. I was just about to check the possible responses but they seem to have been translated already

response: the multiple-choice options for each extract “Omstreden naar huidige maatstaven” (“Contentious according to current standards”), “Niet omstreden" (“Not contentious”), “Weet ik niet” (“I don’t know”), “Onleesbare OCR” ("Illegible OCR”)

I think we could keep these in Dutch or translate them into English. I would probably translate them into English since non-dutch speakers may also end up working with this dataset but let me know if you think this doesn't make sense.

shamikbose commented 2 years ago

We could do some processing to generate an annotator agreement score, but I think this is up to the end user to decide how to do this.

Yeah, I want to leave it up to the end-user to generate a score instead of forcing one on them

I think we could keep these in Dutch or translate them into English. I would probably translate them into English since non-dutch speakers may also end up working with this dataset but let me know if you think this doesn't make sense.

I could do both. Make two columns annotator_response_english and annotator_response_dutch. It's only a matter of creating a mapping. Depending on who's using the dataset, they can use the approporiate column EDIT: I could also create two different configs, one in Dutch and the other in English. This can be specified when downloading the dataset

davanstrien commented 2 years ago

I could do both. Make two columns annotator_response_english and annotator_response_dutch. It's only a matter of creating a mapping. Depending on who's using the dataset, they can use the approporiate column EDIT: I could also create two different configs, one in Dutch and the other in English. This can be specified when downloading the dataset

That sounds good :)

shamikbose commented 2 years ago

@davanstrien https://huggingface.co/datasets/shamikbose89/contentious_contexts

davanstrien commented 2 years ago

@davanstrien huggingface.co/datasets/shamikbose89/contentious_contexts

Is this ready for review? I had a quick look, and all looks good so far. Do you want to move across to BigLAM org? I'll maybe just add a section in the datacard on suggested ways of processing the annotations to give people some starting points for approaching that.

shamikbose commented 2 years ago

Yes, it's #ready-for-review Sorry I forgot to add the tag

shamikbose commented 2 years ago

ready-for-review

davanstrien commented 2 years ago

No worries, I'll take a look at this tomorrow (and try and catch up with the others too!)

shamikbose commented 2 years ago

No problem! I am going to be away for two weeks starting next week, but I'll take a look at any issues as soon as I'm back. Hopefully, I can build a few more dataloaders before I leave

davanstrien commented 2 years ago

I made a few minor suggested changes in a PR here https://huggingface.co/datasets/biglam/contentious_contexts/discussions. I'll make a separate PR for some suggestions for the datacard.

davanstrien commented 2 years ago

Closing this one, @shamikbose thanks so much for working on this :)