DFKI-NLP / sherlock

State-of-the-art Information Extraction
3 stars 1 forks source link

Prepare dataset for training Binary RC #54

Closed leonhardhennig closed 2 years ago

leonhardhennig commented 2 years ago

Prepare a large supervised dataset for training binary RC models. The dataset should be in the dfki-tacred-jsonl format, so that we can use the corresponding reader. Merge the following datasets, creating a mapping between relations as necessary (see https://www.overleaf.com/9545956456ysgmbxxrsxfv) , by merging their respective train, dev and test splits:

Exclude relation types that I listed as "not so useful" per dataset in https://www.overleaf.com/9545956456ysgmbxxrsxfv If a dataset does not have a freely available test split (e.g. Fewrel) just ignore

Store the original datasets on the GPU cluster in /ds/text (so that we have them available for future experiments as well). Store the "unionized" large dataset there as well, with a sensible name + readme describing/pointing to our work.

phucdev commented 2 years ago

I uploaded the original datasets on the GPU cluster. The test split in KnowledgeNet does not contain facts, i.e. only the train split is fully annotated.

phucdev commented 2 years ago

@leonhardhennig

  1. Are there any guidelines for the naming of the relations? Or is there a starter set for the PLASS relation type set? I'm trying to figure out how to construct the unified relation type set and the mapping from the different datasets.
  2. It probably makes sense to use the TACRED format for the "unionized" dataset right?
phucdev commented 2 years ago

DOCRED: filter to only include sentences? (I agree that mixing datasets with document-level vs sentence- level relations is probably not a good idea.)

Does this mean that we only include examples, where the head and tail entity are in the same sentence? The evidence field should probably only contain one sentence id then. In the example in the paper we would only use the first relation example and discard the second example.

image

leonhardhennig commented 2 years ago

@leonhardhennig

1. Are there any guidelines for the naming of the relations? Or is there a starter set for the PLASS relation type set? I'm trying to figure out how to construct the unified relation type set and the mapping from the different datasets.

I'd start with TACRED, and the Knowledge Net mappings you already have. I checked all relations and their types here: https://www.overleaf.com/9545956456ysgmbxxrsxfv - if you can make sense of my notation, this might be a start for the mapping. E.g. for Docred, the section "covered in TACRED/KNET"

2. It probably makes sense to use the TACRED format for the "unionized" dataset right?

yes (either the original json or the "dfki" jsonl version which is a bit more efficient to read)

leonhardhennig commented 2 years ago

Does this mean that we only include examples, where the head and tail entity are in the same sentence?

yes, that was the idea, if there are any examples of that type

leonhardhennig commented 2 years ago

@phucdev @kikoucha one more request - if you haven't started a lot of training yet: The original TACRED dataset is quite 'mislabeled', and we actually did publish a patch for the test set - https://github.com/DFKI-NLP/tacrev#-patch-the-original-tacred . Is there still time to integrate this? Sorry that I forgot to put this in the issue earlier. The patched version could maybe be stored on the GPU cluster as well, e.g. in koeln:/ds/text/tacred/patched_tacrev ?

phucdev commented 2 years ago

I'll look into it and upload the patched version to the GPU cluster

phucdev commented 2 years ago

I did not have permission to store it in koeln:/ds/text/tacred/patched_tacrev, so I stored it in in koeln:/ds/text/patched_tacrev

phucdev commented 2 years ago

A set of questions:

1. Merging location relation types

Do we merge location relation types: (city|stateorprovince|country)_of_xxx to place_of_xxx, i.e. mapping the more specific location relation types in TACRED to more general location relation types. Otherwise we would have a mix of more specific and general location relation types that would most likely be confusing for the model that is trained on that data.

2. Merge child/parent relation types/ add inverse relation types

In TACRED we have org:parents & org_subsidiaries and per:parents & per:children. Do we merge for example per:parents & per:children by mapping per:parents to per:children and flipping the arguments? I ask this because datasets such as KNET only contain a CHILD_OF relation type (and SUBSIDIARY_OF).

So do we: a) Merge the parent/children relation types: map one to the other and flip the arguments b) Keep both relation types c) Keep both relation types and add the inverse relation type, i.e. infer per:parents from CHILD_OF in KNET

3. Generate negative examples for datasets with no negative examples

Some of the datasets e.g. KNET do not contain negative examples. Do we generate negative examples for those or are the negative examples in TACRED, GIDS and PLASS enough?

4. Naming

I tried to follow the naming scheme in TACRED and came up with following "unified" label set: https://github.com/DFKI-NLP/sherlock/blob/wip_add_dataset_preprocessing/sherlock/dataset_preprocessors/relation_types.py

It is not as obvious adding a NER prefix to the following labels:

leonhardhennig commented 2 years ago

Do we merge location relation types: (city|stateorprovince|country)_of_xxx to place_of_xxx, i.e. mapping the more specific location relation types in TACRED to more general location relation types. Otherwise we would have a mix of more specific and general location relation types that would most likely be confusing for the model that is trained on that data.

Yes, makes sense.

Merge child/parent relation types/ add inverse relation types

No. I know this is inconsistent, as in most of the datasets (either they use inverse or not, or sometimes not for all relations, ...). so option b) - just keep everything as it is, don't add/augment with generating inverse relations.

Generate negative examples

Hmm, not sure. For now, let's not generate any new negatives, and see what the %NA value looks like in the end

Naming

Looks good. we could have both per:publisher and org:publisher (similar to per/org: parents). and the same for location (fac:loc, event:loc, item:loc). gpe:head_of_gov/state also is ok.

leonhardhennig commented 2 years ago

The main problem will be that there will be many relations - too many to validate with crowd workers. The NER tagset already had to be split into 3 separate crowd validation jobs, relations will be even harder to annotate. Let's look over the final relation_types list together and see how many examples we have per relation type - maybe we need to drop long-tail relations...

phucdev commented 2 years ago

Okay, a quick follow up on 4. Naming:

Naming

Looks good. we could have both per:publisher and org:publisher (similar to per/org: parents). and the same for location (fac:loc, event:loc, item:loc). gpe:head_of_gov/state also is ok.

In order to get the right prefix we would need the correct entity type for the subject, but some of the datasets (e.g. FewRel) do not contain NER annotation. I also suspect that further separating the relations might worsen the main problem you described.

leonhardhennig commented 2 years ago

ok then let's drop those relations for now.

phucdev commented 2 years ago

Update on the converted dataset

I went through the DocRED and FewRel mappings again, fixed some mistakes, added some missing ones, swapped arguments for some relations to fit the TACRED relations and added useful information to the relation mappings (entity type order). I then added a bit of logging to get some information on the data and the resulting label distribution.

Label distribution (updated on 30.07.2022)

merged_dataset_label_distribution

# 68124 examples in original file, 68124 examples in converted file, 0 examples were discarded during label mapping
tacrev_train = {'org:founded_by': 124, 'no_relation': 55112, 'per:employee_of': 1524, 'org:alternate_names': 808, 'per:places_of_residence': 1150, 'per:children': 211, 'per:title': 2443, 'per:siblings': 165, 'per:religion': 53, 'per:age': 390, 'org:website': 111, 'org:member_of': 122, 'org:top_members/employees': 1890, 'org:place_of_headquarters': 1079, 'org:members': 170, 'per:spouse': 258, 'org:number_of_employees/members': 75, 'org:parents': 286, 'org:subsidiaries': 296, 'per:origin': 325, 'org:political/religious_affiliation': 105, 'per:other_family': 179, 'per:place_of_birth': 131, 'org:dissolved': 23, 'per:date_of_death': 134, 'org:shareholders': 76, 'per:alternate_names': 104, 'per:parents': 152, 'per:schools_attended': 149, 'per:cause_of_death': 117, 'per:place_of_death': 136, 'org:founded': 91, 'per:date_of_birth': 63, 'per:charges': 72}
# 22631 examples in original file, 22631 examples in converted file, 0 examples were discarded during label mapping
tacrev_dev ={'per:title': 969, 'no_relation': 17331, 'per:place_of_death': 210, 'per:origin': 240, 'per:date_of_death': 188, 'org:top_members/employees': 528, 'org:place_of_headquarters': 387, 'per:religion': 62, 'per:place_of_birth': 94, 'per:employee_of': 407, 'org:website': 91, 'per:cause_of_death': 175, 'org:subsidiaries': 105, 'per:places_of_residence': 312, 'per:siblings': 29, 'org:alternate_names': 348, 'per:parents': 56, 'per:spouse': 155, 'per:age': 245, 'per:date_of_birth': 31, 'per:children': 105, 'org:parents': 81, 'per:schools_attended': 50, 'per:charges': 115, 'org:shareholders': 35, 'org:founded': 34, 'org:founded_by': 75, 'org:members': 64, 'per:alternate_names': 37, 'org:number_of_employees/members': 24, 'org:political/religious_affiliation': 14, 'per:other_family': 26, 'org:member_of': 7, 'org:dissolved': 1}
# 15509 examples in original file, 15509 examples in converted file, 0 examples were discarded during label mapping
tacrev_test = {'no_relation': 12386, 'per:title': 509, 'org:place_of_headquarters': 232, 'org:top_members/employees': 348, 'per:parents': 86, 'per:age': 201, 'per:places_of_residence': 333, 'per:children': 39, 'org:alternate_names': 245, 'per:charges': 106, 'per:origin': 111, 'org:founded_by': 76, 'per:employee_of': 252, 'per:siblings': 59, 'per:cause_of_death': 47, 'org:website': 27, 'per:place_of_death': 39, 'org:parents': 57, 'org:subsidiaries': 31, 'per:other_family': 37, 'org:number_of_employees/members': 14, 'per:religion': 43, 'per:date_of_birth': 7, 'org:shareholders': 3, 'per:spouse': 66, 'org:member_of': 4, 'per:schools_attended': 31, 'per:date_of_death': 42, 'org:political/religious_affiliation': 10, 'org:founded': 37, 'org:members': 16, 'per:place_of_birth': 13, 'org:dissolved': 1, 'per:alternate_names': 1}

# 10895 examples in converted file, 0 examples were discarded during label mapping
knet_train = {'org:subsidiaries': 432, 'per:date_of_death': 493, 'per:origin': 513, 'org:top_members/employees': 514, 'per:schools_attended': 756, 'org:founded_by': 595, 'per:spouse': 1092, 'per:employee_of': 1408, 'org:founded': 412, 'per:political_affiliation': 504, 'per:place_of_birth': 909, 'per:places_of_residence': 1187, 'per:date_of_birth': 619, 'org:place_of_headquarters': 733, 'per:children': 728}

# 11297 examples in original file, 11297 examples in converted file, 0 examples were discarded during label mapping
gids_train = {'per:place_of_death': 2088, 'per:place_of_birth': 2001, 'no_relation': 2771, 'per:schools_attended': 2652, 'per:degree': 1785}
# 1864 examples in original file, 184 examples in converted file, 0 examples were discarded during label mapping
gids_dev = {'per:place_of_death': 365, 'per:schools_attended': 439, 'no_relation': 447, 'per:place_of_birth': 323, 'per:degree': 290}
# 5663 examples in original file, 5663 examples in converted file, 0 examples were discarded during label mapping
gids_test = {'per:place_of_death': 1016, 'per:place_of_birth': 1032, 'no_relation': 1356, 'per:degree': 894, 'per:schools_attended': 1365}

# 14689examples in converted file, 6237 examples were discarded during label mapping
docred_train = {'org:place_of_headquarters': 143, 'loc:country': 3745, 'loc:located_in': 2605, 'per:country_of_citizenship': 1405, 'per:date_of_birth': 994, 'per:place_of_birth': 356, 'org:founded': 193, 'org:dissolved': 60, 'per:member_of': 155, 'per:performer': 615, 'per:place_of_death': 113, 'per:date_of_death': 699, 'per:head_of_gov/state': 183, 'org:members': 73, 'loc:capital_of': 88, 'per:spouse': 215, 'per:parents': 208, 'per:children': 212, 'loc:country_of_origin': 223, 'per:notable_work': 127, 'per:political_affiliation': 267, 'org:location_of_formation': 43, 'event:conflict': 140, 'per:schools_attended': 146, 'org:production_company': 46, 'per:director': 152, 'per:employee_of': 115, 'org:shareholders': 98, 'per:title': 18, 'per:composer': 46, 'per:lyrics_by': 20, 'per:author': 221, 'per:producer': 59, 'per:siblings': 243, 'per:religion': 72, 'per:places_of_residence': 22, 'org:top_members/employees': 46, 'per:screenwriter': 74, 'per:creator': 77, 'org:founded_by': 57, 'org:developer': 148, 'org:product_or_technology_or_service': 72, 'org:parents': 38, 'org:subsidiaries': 52, 'loc:unemployment_rate': 1, 'loc:twinned_adm_body': 4}
# 4714 examples in converted file, 2030 examples were discarded during label mapping
docred_dev = {'org:place_of_headquarters': 47, 'loc:country': 1201, 'loc:located_in': 799, 'org:parents': 32, 'org:subsidiaries': 24, 'per:employee_of': 25, 'per:country_of_citizenship': 422, 'per:performer': 219, 'per:place_of_birth': 104, 'per:date_of_birth': 334, 'per:date_of_death': 232, 'org:product_or_technology_or_service': 32, 'org:founded': 56, 'loc:capital_of': 25, 'event:conflict': 58, 'per:political_affiliation': 81, 'per:member_of': 50, 'org:dissolved': 24, 'per:title': 5, 'per:places_of_residence': 5, 'org:shareholders': 37, 'per:spouse': 64, 'per:place_of_death': 31, 'per:composer': 17, 'loc:country_of_origin': 78, 'org:members': 15, 'per:director': 53, 'per:screenwriter': 10, 'per:producer': 13, 'org:production_company': 24, 'per:siblings': 100, 'per:children': 37, 'per:parents': 33, 'org:location_of_formation': 10, 'per:head_of_gov/state': 56, 'org:developer': 33, 'per:lyrics_by': 3, 'per:schools_attended': 38, 'org:top_members/employees': 15, 'per:author': 78, 'per:creator': 23, 'per:notable_work': 48, 'org:founded_by': 20, 'per:religion': 48, 'loc:twinned_adm_body': 2}

# 16800 examples in converted file, 28000 examples were discarded during label mapping
fewrel_train = {'per:religion': 700, 'per:head_of_gov/state': 700, 'per:country_of_citizenship': 700, 'per:performer': 700, 'per:title': 1400, 'org:location_of_formation': 700, 'loc:located_in': 1400, 'loc:country_of_origin': 700, 'per:director': 700, 'per:parents': 700, 'org:product_or_technology_or_service': 700, 'per:political_affiliation': 700, 'org:place_of_headquarters': 700, 'per:siblings': 700, 'loc:country': 700, 'per:places_of_residence': 700, 'org:subsidiaries': 700, 'org:shareholders': 700, 'per:composer': 700, 'per:screenwriter': 700, 'per:field_of_work': 700, 'per:notable_work': 700}
# 2800 examples in converted file, 8400 examples were discarded during label mapping
fewrel_dev = {'per:spouse': 700, 'per:parents': 700, 'org:members': 700, 'per:children': 700}

# 35806 examples in original file, 19213 examples in converted file, 16593 examples were discarded during label mapping
smiler_train = {'per:children': 1019, 'org:members': 590, 'per:member_of': 646, 'per:country_of_citizenship': 2772, 'per:title': 1904, 'loc:location_of': 2953, 'per:siblings': 717, 'no_relation': 1319, 'per:director': 2241, 'per:spouse': 949, 'per:place_of_birth': 1754, 'org:top_members/employees': 614, 'per:parents': 1089, 'per:origin': 193, 'org:place_of_headquarters': 146, 'org:founded_by': 307}
# 731 examples in original file, 393 examples in converted file, 338 examples were discarded during label mapping
smiler_test = {'no_relation': 27, 'per:origin': 4, 'per:director': 46, 'per:children': 21, 'org:top_members/employees': 13, 'per:place_of_birth': 36, 'per:title': 39, 'per:country_of_citizenship': 57, 'per:spouse': 19, 'org:founded_by': 6, 'per:siblings': 15, 'per:member_of': 13, 'loc:location_of': 60, 'org:members': 12, 'per:parents': 22, 'org:place_of_headquarters': 3}

# 15917 examples in original file, 15917examples in converted file, 0 examples were discarded during label mapping
kbp37_train = {'per:employee_of': 3472, 'org:place_of_headquarters': 2790, 'org:members': 703, 'org:founded_by': 355, 'org:subsidiaries': 402, 'per:places_of_residence': 3043, 'no_relation': 1545, 'per:title': 641, 'org:top_members/employees': 576, 'org:founded': 393, 'org:alternate_names': 511, 'per:spouse': 258, 'per:place_of_birth': 355, 'org:parents': 430, 'per:alternate_names': 177, 'per:origin': 266}
# 1724 examples in original file, 1724 examples in converted file, 0 examples were discarded during label mapping
kbp37_dev = {'org:alternate_names': 63, 'org:place_of_headquarters': 341, 'per:places_of_residence': 290, 'org:parents': 54, 'per:origin': 28, 'org:subsidiaries': 49, 'org:founded': 53, 'no_relation': 210, 'per:spouse': 29, 'org:members': 82, 'per:alternate_names': 24, 'per:employee_of': 273, 'per:place_of_birth': 50, 'org:founded_by': 34, 'org:top_members/employees': 68, 'per:title': 76}
# 3405 examples in original file, 3405 examples in converted file, 0 examples were discarded during label mapping
kbp37_test = {'org:place_of_headquarters': 659, 'org:alternate_names': 125, 'per:places_of_residence': 564, 'org:members': 160, 'per:alternate_names': 46, 'org:parents': 103, 'per:spouse': 57, 'org:subsidiaries': 90, 'no_relation': 419, 'per:employee_of': 568, 'per:origin': 65, 'per:title': 137, 'org:founded': 107, 'org:top_members/employees': 136, 'org:founded_by': 80, 'per:place_of_birth': 89}

The collated dataset:

Some notes/questions

TODOs

leonhardhennig commented 2 years ago

KnowledgeNet, DocRED and FewRel do not have test splits. The FewRel splits have different relation types. a) union of everything, create new train/dev/test split (stratified)

  • does not allow testing on original test splits
  • better sample size per class b) create missing splits of individual datasets, then do the merge
  • allows testing on original splits
  • more effort

Decision: use a)

Most of the datasets contain NER annotation, but FewRel does not. Do we need to create NER mappings for those that contain NER annotation?

Decision - discard all RE instances from source datasets where we don't have the exact NER type either for HEAD or for TAIL

leonhardhennig commented 2 years ago

We could also use KBP37, very similar to TACRED Source: https://github.com/thunlp/RE-Context-or-Names/tree/master/finetune/supervisedRE/data/kbp37 or https://github.com/zhangdongxu/kbp37 Paper: https://arxiv.org/abs/1508.01006

I also have a JSONL version

phucdev commented 2 years ago

Quick update

I added the following things:

Then I merged all of the data, shuffled the merged data and created a 80% train,10% dev, 10% test split and uploaded it to the cluster under /ds/text/UnionizedRelExDataset.

Potential data leakage in the unionized dataset

I noticed the following potential problem with this "merge all the data and create our own split" approach: Datasets may contain multiple examples with the same text/tokens, but different pairs of entities, which was not problematic when they were in the same split. As a result we may have some data leakage in the unionized dataset.

phucdev commented 2 years ago

Addressing the data leakage issue

I implemented option b) and added the option to move examples from the train split to the test split if their token sequence/text is also seen in the test split.

This resulted in a 70.3% train, 15.6% dev, 14.1% test split.

The model trained on this dataset had the following evaluation results:

The model trained on the initial 80% train, 10% dev, 10% test split using option a) had the following evaluation results:

Keep in mind that we deduced the "type" field from the relation type for most of the examples that had no entity type annotation and that all the examples with no "type" field were filtered out for the training and testing.

phucdev commented 2 years ago

We will go with option b) (keep splits and make our own splits if the dataset does not have a dedicated split + measures against data leakage). The only thing left to check is:

Possible data leakage if datasets used the same raw data We currently move examples (same sentence text, but different entity pair) from train to test on a per dataset basis. Smiler, Knet, Tacred were constructed from Wikipedia text, so it might be possible that we may find some overlap here as well.

phucdev commented 2 years ago

After skimming through the publications I found that most of the datasets are based on Wikipedia/Wikidata/DBpedia. Only the authors of KBP37, TACRED, DocRED actually describe which Wikipedia dump was used for the construction of their corpus.

I manually checked for duplicates and found some sentences that were shared across datasets. To fix this I added another processing step to move those sentences from the merged train split to the merged dev/test split.

This resulted in a 70.1% train, 15.7% dev, 14.2% test split.

151087 examples in the train split 33887 examples in the dev split 30562 examples in the test split

IMO this issue can be closed for now

leonhardhennig commented 2 years ago

nice, thanks!