Closed davanstrien closed 2 years ago
For this dataset, I was thinking of using a structure as follows:
extract_id | text | target | annotator_responses |
---|---|---|---|
H-99 | Hollandsche IJzeren Spoorweg-Maatschappij een vijftal rijtuigen, 𝙜𝙚𝙢𝙚𝙣𝙜𝙙 eerste en tweede klasse, die zóó schoon zijn afgewerkt, dat zij m | gemengd | [{annotator_id: [response, suggestion]}] |
Does this work, @davanstrien? P.S. As a person who does not speak Dutch, is there something I should know before I process this dataset?
For this dataset, I was thinking of using a structure as follows: extract_id text target annotator_responses H-99 Hollandsche IJzeren Spoorweg-Maatschappij een vijftal rijtuigen, 𝙜𝙚𝙢𝙚𝙣𝙜𝙙 eerste en tweede klasse, die zóó schoon zijn afgewerkt, dat zij m gemengd [{annotator_id: [response, suggestion]}]
Does this work, @davanstrien?
I think that makes sense. We could do some processing to generate an annotator agreement score, but I think this is up to the end user to decide how to do this.
P.S. As a person who does not speak Dutch, is there something I should know before I process this dataset?
The only thing I was going to suggest was putting the column names in English, but I think you already worked those all out. I was just about to check the possible responses but they seem to have been translated already
response: the multiple-choice options for each extract “Omstreden naar huidige maatstaven” (“Contentious according to current standards”), “Niet omstreden" (“Not contentious”), “Weet ik niet” (“I don’t know”), “Onleesbare OCR” ("Illegible OCR”)
I think we could keep these in Dutch or translate them into English. I would probably translate them into English since non-dutch speakers may also end up working with this dataset but let me know if you think this doesn't make sense.
We could do some processing to generate an annotator agreement score, but I think this is up to the end user to decide how to do this.
Yeah, I want to leave it up to the end-user to generate a score instead of forcing one on them
I think we could keep these in Dutch or translate them into English. I would probably translate them into English since non-dutch speakers may also end up working with this dataset but let me know if you think this doesn't make sense.
I could do both. Make two columns annotator_response_english
and annotator_response_dutch
. It's only a matter of creating a mapping. Depending on who's using the dataset, they can use the approporiate column
EDIT: I could also create two different configs, one in Dutch and the other in English. This can be specified when downloading the dataset
I could do both. Make two columns
annotator_response_english
andannotator_response_dutch
. It's only a matter of creating a mapping. Depending on who's using the dataset, they can use the approporiate column EDIT: I could also create two different configs, one in Dutch and the other in English. This can be specified when downloading the dataset
That sounds good :)
@davanstrien huggingface.co/datasets/shamikbose89/contentious_contexts
Is this ready for review? I had a quick look, and all looks good so far. Do you want to move across to BigLAM org? I'll maybe just add a section in the datacard on suggested ways of processing the annotations to give people some starting points for approaching that.
Yes, it's #ready-for-review Sorry I forgot to add the tag
No worries, I'll take a look at this tomorrow (and try and catch up with the others too!)
No problem! I am going to be away for two weeks starting next week, but I'll take a look at any issues as soon as I'm back. Hopefully, I can build a few more dataloaders before I leave
I made a few minor suggested changes in a PR here https://huggingface.co/datasets/biglam/contentious_contexts/discussions. I'll make a separate PR for some suggestions for the datacard.
Closing this one, @shamikbose thanks so much for working on this :)
A URL for this dataset
https://github.com/cultural-ai/ConConCor
Dataset description
This dataset contains extracts from historical newspapers which have been containing keywords of potentially contentious words (according to present-day sensibilities). The dataset contains multiple annotations per instance, given the option to quantify agreement scores for annotations.
This dataset is potentially helpful in exploring how well machine learning methods can identify contentious use of words in historical texts (or predict the degree to which human annotators will find these words contentious).
The dataset also contains a label for 'Unreadable OCR' given this dataset a secondary function as a dataset for evaluating machine learning methods for classifying whether a particular piece of OCR text will be deemed unreadable by a human annotator.
Dataset modality
Text
Dataset licence
No response
Other licence
CC-BY 2.0
How can you access this data
As a download from a repository/website
Confirm the dataset has an open licence
Contact details for data custodian
No response