Closed leonhardhennig closed 2 years ago
I uploaded the original datasets on the GPU cluster. The test split in KnowledgeNet does not contain facts, i.e. only the train split is fully annotated.
@leonhardhennig
DOCRED: filter to only include sentences? (I agree that mixing datasets with document-level vs sentence- level relations is probably not a good idea.)
Does this mean that we only include examples, where the head and tail entity are in the same sentence? The evidence field should probably only contain one sentence id then. In the example in the paper we would only use the first relation example and discard the second example.
@leonhardhennig
1. Are there any guidelines for the naming of the relations? Or is there a starter set for the PLASS relation type set? I'm trying to figure out how to construct the unified relation type set and the mapping from the different datasets.
I'd start with TACRED, and the Knowledge Net mappings you already have. I checked all relations and their types here: https://www.overleaf.com/9545956456ysgmbxxrsxfv - if you can make sense of my notation, this might be a start for the mapping. E.g. for Docred, the section "covered in TACRED/KNET"
2. It probably makes sense to use the TACRED format for the "unionized" dataset right?
yes (either the original json or the "dfki" jsonl version which is a bit more efficient to read)
Does this mean that we only include examples, where the head and tail entity are in the same sentence?
yes, that was the idea, if there are any examples of that type
@phucdev @kikoucha one more request - if you haven't started a lot of training yet: The original TACRED dataset is quite 'mislabeled', and we actually did publish a patch for the test set - https://github.com/DFKI-NLP/tacrev#-patch-the-original-tacred . Is there still time to integrate this? Sorry that I forgot to put this in the issue earlier. The patched version could maybe be stored on the GPU cluster as well, e.g. in koeln:/ds/text/tacred/patched_tacrev ?
I'll look into it and upload the patched version to the GPU cluster
I did not have permission to store it in koeln:/ds/text/tacred/patched_tacrev, so I stored it in in koeln:/ds/text/patched_tacrev
A set of questions:
Do we merge location relation types: (city|stateorprovince|country)_of_xxx to place_of_xxx, i.e. mapping the more specific location relation types in TACRED to more general location relation types. Otherwise we would have a mix of more specific and general location relation types that would most likely be confusing for the model that is trained on that data.
In TACRED we have org:parents & org_subsidiaries and per:parents & per:children. Do we merge for example per:parents & per:children by mapping per:parents to per:children and flipping the arguments? I ask this because datasets such as KNET only contain a CHILD_OF relation type (and SUBSIDIARY_OF).
So do we: a) Merge the parent/children relation types: map one to the other and flip the arguments b) Keep both relation types c) Keep both relation types and add the inverse relation type, i.e. infer per:parents from CHILD_OF in KNET
Some of the datasets e.g. KNET do not contain negative examples. Do we generate negative examples for those or are the negative examples in TACRED, GIDS and PLASS enough?
I tried to follow the naming scheme in TACRED and came up with following "unified" label set: https://github.com/DFKI-NLP/sherlock/blob/wip_add_dataset_preprocessing/sherlock/dataset_preprocessors/relation_types.py
It is not as obvious adding a NER prefix to the following labels:
publisher
, (org/per, work_of_art) P123 -> org:publisher
or per:publisher
head_of_gov/state
, (gpe, per) P6, P35 -> gpe:head_of_gov/state
location
, (fac/event/item, loc) P276 -> ?Do we merge location relation types: (city|stateorprovince|country)_of_xxx to place_of_xxx, i.e. mapping the more specific location relation types in TACRED to more general location relation types. Otherwise we would have a mix of more specific and general location relation types that would most likely be confusing for the model that is trained on that data.
Yes, makes sense.
Merge child/parent relation types/ add inverse relation types
No. I know this is inconsistent, as in most of the datasets (either they use inverse or not, or sometimes not for all relations, ...). so option b) - just keep everything as it is, don't add/augment with generating inverse relations.
Generate negative examples
Hmm, not sure. For now, let's not generate any new negatives, and see what the %NA value looks like in the end
Naming
Looks good. we could have both per:publisher and org:publisher (similar to per/org: parents). and the same for location (fac:loc, event:loc, item:loc). gpe:head_of_gov/state also is ok.
The main problem will be that there will be many relations - too many to validate with crowd workers. The NER tagset already had to be split into 3 separate crowd validation jobs, relations will be even harder to annotate. Let's look over the final relation_types list together and see how many examples we have per relation type - maybe we need to drop long-tail relations...
Okay, a quick follow up on 4. Naming:
Naming
Looks good. we could have both per:publisher and org:publisher (similar to per/org: parents). and the same for location (fac:loc, event:loc, item:loc). gpe:head_of_gov/state also is ok.
In order to get the right prefix we would need the correct entity type for the subject, but some of the datasets (e.g. FewRel) do not contain NER annotation. I also suspect that further separating the relations might worsen the main problem you described.
ok then let's drop those relations for now.
I went through the DocRED and FewRel mappings again, fixed some mistakes, added some missing ones, swapped arguments for some relations to fit the TACRED relations and added useful information to the relation mappings (entity type order). I then added a bit of logging to get some information on the data and the resulting label distribution.
# 68124 examples in original file, 68124 examples in converted file, 0 examples were discarded during label mapping
tacrev_train = {'org:founded_by': 124, 'no_relation': 55112, 'per:employee_of': 1524, 'org:alternate_names': 808, 'per:places_of_residence': 1150, 'per:children': 211, 'per:title': 2443, 'per:siblings': 165, 'per:religion': 53, 'per:age': 390, 'org:website': 111, 'org:member_of': 122, 'org:top_members/employees': 1890, 'org:place_of_headquarters': 1079, 'org:members': 170, 'per:spouse': 258, 'org:number_of_employees/members': 75, 'org:parents': 286, 'org:subsidiaries': 296, 'per:origin': 325, 'org:political/religious_affiliation': 105, 'per:other_family': 179, 'per:place_of_birth': 131, 'org:dissolved': 23, 'per:date_of_death': 134, 'org:shareholders': 76, 'per:alternate_names': 104, 'per:parents': 152, 'per:schools_attended': 149, 'per:cause_of_death': 117, 'per:place_of_death': 136, 'org:founded': 91, 'per:date_of_birth': 63, 'per:charges': 72}
# 22631 examples in original file, 22631 examples in converted file, 0 examples were discarded during label mapping
tacrev_dev ={'per:title': 969, 'no_relation': 17331, 'per:place_of_death': 210, 'per:origin': 240, 'per:date_of_death': 188, 'org:top_members/employees': 528, 'org:place_of_headquarters': 387, 'per:religion': 62, 'per:place_of_birth': 94, 'per:employee_of': 407, 'org:website': 91, 'per:cause_of_death': 175, 'org:subsidiaries': 105, 'per:places_of_residence': 312, 'per:siblings': 29, 'org:alternate_names': 348, 'per:parents': 56, 'per:spouse': 155, 'per:age': 245, 'per:date_of_birth': 31, 'per:children': 105, 'org:parents': 81, 'per:schools_attended': 50, 'per:charges': 115, 'org:shareholders': 35, 'org:founded': 34, 'org:founded_by': 75, 'org:members': 64, 'per:alternate_names': 37, 'org:number_of_employees/members': 24, 'org:political/religious_affiliation': 14, 'per:other_family': 26, 'org:member_of': 7, 'org:dissolved': 1}
# 15509 examples in original file, 15509 examples in converted file, 0 examples were discarded during label mapping
tacrev_test = {'no_relation': 12386, 'per:title': 509, 'org:place_of_headquarters': 232, 'org:top_members/employees': 348, 'per:parents': 86, 'per:age': 201, 'per:places_of_residence': 333, 'per:children': 39, 'org:alternate_names': 245, 'per:charges': 106, 'per:origin': 111, 'org:founded_by': 76, 'per:employee_of': 252, 'per:siblings': 59, 'per:cause_of_death': 47, 'org:website': 27, 'per:place_of_death': 39, 'org:parents': 57, 'org:subsidiaries': 31, 'per:other_family': 37, 'org:number_of_employees/members': 14, 'per:religion': 43, 'per:date_of_birth': 7, 'org:shareholders': 3, 'per:spouse': 66, 'org:member_of': 4, 'per:schools_attended': 31, 'per:date_of_death': 42, 'org:political/religious_affiliation': 10, 'org:founded': 37, 'org:members': 16, 'per:place_of_birth': 13, 'org:dissolved': 1, 'per:alternate_names': 1}
# 10895 examples in converted file, 0 examples were discarded during label mapping
knet_train = {'org:subsidiaries': 432, 'per:date_of_death': 493, 'per:origin': 513, 'org:top_members/employees': 514, 'per:schools_attended': 756, 'org:founded_by': 595, 'per:spouse': 1092, 'per:employee_of': 1408, 'org:founded': 412, 'per:political_affiliation': 504, 'per:place_of_birth': 909, 'per:places_of_residence': 1187, 'per:date_of_birth': 619, 'org:place_of_headquarters': 733, 'per:children': 728}
# 11297 examples in original file, 11297 examples in converted file, 0 examples were discarded during label mapping
gids_train = {'per:place_of_death': 2088, 'per:place_of_birth': 2001, 'no_relation': 2771, 'per:schools_attended': 2652, 'per:degree': 1785}
# 1864 examples in original file, 184 examples in converted file, 0 examples were discarded during label mapping
gids_dev = {'per:place_of_death': 365, 'per:schools_attended': 439, 'no_relation': 447, 'per:place_of_birth': 323, 'per:degree': 290}
# 5663 examples in original file, 5663 examples in converted file, 0 examples were discarded during label mapping
gids_test = {'per:place_of_death': 1016, 'per:place_of_birth': 1032, 'no_relation': 1356, 'per:degree': 894, 'per:schools_attended': 1365}
# 14689examples in converted file, 6237 examples were discarded during label mapping
docred_train = {'org:place_of_headquarters': 143, 'loc:country': 3745, 'loc:located_in': 2605, 'per:country_of_citizenship': 1405, 'per:date_of_birth': 994, 'per:place_of_birth': 356, 'org:founded': 193, 'org:dissolved': 60, 'per:member_of': 155, 'per:performer': 615, 'per:place_of_death': 113, 'per:date_of_death': 699, 'per:head_of_gov/state': 183, 'org:members': 73, 'loc:capital_of': 88, 'per:spouse': 215, 'per:parents': 208, 'per:children': 212, 'loc:country_of_origin': 223, 'per:notable_work': 127, 'per:political_affiliation': 267, 'org:location_of_formation': 43, 'event:conflict': 140, 'per:schools_attended': 146, 'org:production_company': 46, 'per:director': 152, 'per:employee_of': 115, 'org:shareholders': 98, 'per:title': 18, 'per:composer': 46, 'per:lyrics_by': 20, 'per:author': 221, 'per:producer': 59, 'per:siblings': 243, 'per:religion': 72, 'per:places_of_residence': 22, 'org:top_members/employees': 46, 'per:screenwriter': 74, 'per:creator': 77, 'org:founded_by': 57, 'org:developer': 148, 'org:product_or_technology_or_service': 72, 'org:parents': 38, 'org:subsidiaries': 52, 'loc:unemployment_rate': 1, 'loc:twinned_adm_body': 4}
# 4714 examples in converted file, 2030 examples were discarded during label mapping
docred_dev = {'org:place_of_headquarters': 47, 'loc:country': 1201, 'loc:located_in': 799, 'org:parents': 32, 'org:subsidiaries': 24, 'per:employee_of': 25, 'per:country_of_citizenship': 422, 'per:performer': 219, 'per:place_of_birth': 104, 'per:date_of_birth': 334, 'per:date_of_death': 232, 'org:product_or_technology_or_service': 32, 'org:founded': 56, 'loc:capital_of': 25, 'event:conflict': 58, 'per:political_affiliation': 81, 'per:member_of': 50, 'org:dissolved': 24, 'per:title': 5, 'per:places_of_residence': 5, 'org:shareholders': 37, 'per:spouse': 64, 'per:place_of_death': 31, 'per:composer': 17, 'loc:country_of_origin': 78, 'org:members': 15, 'per:director': 53, 'per:screenwriter': 10, 'per:producer': 13, 'org:production_company': 24, 'per:siblings': 100, 'per:children': 37, 'per:parents': 33, 'org:location_of_formation': 10, 'per:head_of_gov/state': 56, 'org:developer': 33, 'per:lyrics_by': 3, 'per:schools_attended': 38, 'org:top_members/employees': 15, 'per:author': 78, 'per:creator': 23, 'per:notable_work': 48, 'org:founded_by': 20, 'per:religion': 48, 'loc:twinned_adm_body': 2}
# 16800 examples in converted file, 28000 examples were discarded during label mapping
fewrel_train = {'per:religion': 700, 'per:head_of_gov/state': 700, 'per:country_of_citizenship': 700, 'per:performer': 700, 'per:title': 1400, 'org:location_of_formation': 700, 'loc:located_in': 1400, 'loc:country_of_origin': 700, 'per:director': 700, 'per:parents': 700, 'org:product_or_technology_or_service': 700, 'per:political_affiliation': 700, 'org:place_of_headquarters': 700, 'per:siblings': 700, 'loc:country': 700, 'per:places_of_residence': 700, 'org:subsidiaries': 700, 'org:shareholders': 700, 'per:composer': 700, 'per:screenwriter': 700, 'per:field_of_work': 700, 'per:notable_work': 700}
# 2800 examples in converted file, 8400 examples were discarded during label mapping
fewrel_dev = {'per:spouse': 700, 'per:parents': 700, 'org:members': 700, 'per:children': 700}
# 35806 examples in original file, 19213 examples in converted file, 16593 examples were discarded during label mapping
smiler_train = {'per:children': 1019, 'org:members': 590, 'per:member_of': 646, 'per:country_of_citizenship': 2772, 'per:title': 1904, 'loc:location_of': 2953, 'per:siblings': 717, 'no_relation': 1319, 'per:director': 2241, 'per:spouse': 949, 'per:place_of_birth': 1754, 'org:top_members/employees': 614, 'per:parents': 1089, 'per:origin': 193, 'org:place_of_headquarters': 146, 'org:founded_by': 307}
# 731 examples in original file, 393 examples in converted file, 338 examples were discarded during label mapping
smiler_test = {'no_relation': 27, 'per:origin': 4, 'per:director': 46, 'per:children': 21, 'org:top_members/employees': 13, 'per:place_of_birth': 36, 'per:title': 39, 'per:country_of_citizenship': 57, 'per:spouse': 19, 'org:founded_by': 6, 'per:siblings': 15, 'per:member_of': 13, 'loc:location_of': 60, 'org:members': 12, 'per:parents': 22, 'org:place_of_headquarters': 3}
# 15917 examples in original file, 15917examples in converted file, 0 examples were discarded during label mapping
kbp37_train = {'per:employee_of': 3472, 'org:place_of_headquarters': 2790, 'org:members': 703, 'org:founded_by': 355, 'org:subsidiaries': 402, 'per:places_of_residence': 3043, 'no_relation': 1545, 'per:title': 641, 'org:top_members/employees': 576, 'org:founded': 393, 'org:alternate_names': 511, 'per:spouse': 258, 'per:place_of_birth': 355, 'org:parents': 430, 'per:alternate_names': 177, 'per:origin': 266}
# 1724 examples in original file, 1724 examples in converted file, 0 examples were discarded during label mapping
kbp37_dev = {'org:alternate_names': 63, 'org:place_of_headquarters': 341, 'per:places_of_residence': 290, 'org:parents': 54, 'per:origin': 28, 'org:subsidiaries': 49, 'org:founded': 53, 'no_relation': 210, 'per:spouse': 29, 'org:members': 82, 'per:alternate_names': 24, 'per:employee_of': 273, 'per:place_of_birth': 50, 'org:founded_by': 34, 'org:top_members/employees': 68, 'per:title': 76}
# 3405 examples in original file, 3405 examples in converted file, 0 examples were discarded during label mapping
kbp37_test = {'org:place_of_headquarters': 659, 'org:alternate_names': 125, 'per:places_of_residence': 564, 'org:members': 160, 'per:alternate_names': 46, 'org:parents': 103, 'per:spouse': 57, 'org:subsidiaries': 90, 'no_relation': 419, 'per:employee_of': 568, 'per:origin': 65, 'per:title': 137, 'org:founded': 107, 'org:top_members/employees': 136, 'org:founded_by': 80, 'per:place_of_birth': 89}
The collated dataset:
per:conflict
-> (per/org, misc)
loc:country_of_origin
-> (loc, misc/org/per)
, not sure about the NER prefix and the order, the original entity type order was (misc/org/per, loc)
per:developer
-> (per/org, misc)
per:ethnic_group
-> (per/loc, loc)
with most of the examples looking something like (loc Australia, loc Australian)
org:founded
when mapped from inception
-> (loc/org/misc, time)
org:member_of
-> (per/org, org)
KnowledgeNet, DocRED and FewRel do not have test splits. The FewRel splits have different relation types. a) union of everything, create new train/dev/test split (stratified)
- does not allow testing on original test splits
- better sample size per class b) create missing splits of individual datasets, then do the merge
- allows testing on original splits
- more effort
Decision: use a)
Most of the datasets contain NER annotation, but FewRel does not. Do we need to create NER mappings for those that contain NER annotation?
Follow the TACRED format, i.e. only keep 'NER' info for the 2 arguments of the relation. Either use explicit NER tags for those tokens or derive from relation type semantics. NER tagset should correspond to PLASS-NER unionized tagset. Hard cases - see below
NER types are required for model training in Sherlock (but not strictly necessary). so we could just do the union of all relation labels and filter non-ner-tagged train/dev/test instances later in sherlock model training /evaluation
Decision - discard all RE instances from source datasets where we don't have the exact NER type either for HEAD or for TAIL
We could also use KBP37, very similar to TACRED Source: https://github.com/thunlp/RE-Context-or-Names/tree/master/finetune/supervisedRE/data/kbp37 or https://github.com/zhangdongxu/kbp37 Paper: https://arxiv.org/abs/1508.01006
I also have a JSONL version
I added the following things:
Then I merged all of the data, shuffled the merged data and created a 80% train,10% dev, 10% test split and uploaded it to the cluster under /ds/text/UnionizedRelExDataset.
I noticed the following potential problem with this "merge all the data and create our own split" approach: Datasets may contain multiple examples with the same text/tokens, but different pairs of entities, which was not problematic when they were in the same split. As a result we may have some data leakage in the unionized dataset.
I implemented option b) and added the option to move examples from the train split to the test split if their token sequence/text is also seen in the test split.
This resulted in a 70.3% train, 15.6% dev, 14.1% test split.
The model trained on this dataset had the following evaluation results:
The model trained on the initial 80% train, 10% dev, 10% test split using option a) had the following evaluation results:
Keep in mind that we deduced the "type" field from the relation type for most of the examples that had no entity type annotation and that all the examples with no "type" field were filtered out for the training and testing.
We will go with option b) (keep splits and make our own splits if the dataset does not have a dedicated split + measures against data leakage). The only thing left to check is:
Possible data leakage if datasets used the same raw data We currently move examples (same sentence text, but different entity pair) from train to test on a per dataset basis. Smiler, Knet, Tacred were constructed from Wikipedia text, so it might be possible that we may find some overlap here as well.
After skimming through the publications I found that most of the datasets are based on Wikipedia/Wikidata/DBpedia. Only the authors of KBP37, TACRED, DocRED actually describe which Wikipedia dump was used for the construction of their corpus.
I manually checked for duplicates and found some sentences that were shared across datasets. To fix this I added another processing step to move those sentences from the merged train split to the merged dev/test split.
This resulted in a 70.1% train, 15.7% dev, 14.2% test split.
151087 examples in the train split 33887 examples in the dev split 30562 examples in the test split
IMO this issue can be closed for now
nice, thanks!
Prepare a large supervised dataset for training binary RC models. The dataset should be in the dfki-tacred-jsonl format, so that we can use the corresponding reader. Merge the following datasets, creating a mapping between relations as necessary (see https://www.overleaf.com/9545956456ysgmbxxrsxfv) , by merging their respective train, dev and test splits:
Exclude relation types that I listed as "not so useful" per dataset in https://www.overleaf.com/9545956456ysgmbxxrsxfv If a dataset does not have a freely available test split (e.g. Fewrel) just ignore
Store the original datasets on the GPU cluster in /ds/text (so that we have them available for future experiments as well). Store the "unionized" large dataset there as well, with a sensible name + readme describing/pointing to our work.