cenguix / Text2KGBench

Repo ISWC-2023 Tekgen Corpus Submission
Apache License 2.0
53 stars 10 forks source link

Only one triple per sentence in wikidata_tekgen dataset #18

Open swissarthurfreeman opened 1 week ago

swissarthurfreeman commented 1 week ago

Hello,

We're trying to re-use your dataset to further study fact extraction as part of a master's thesis at the University of Geneva.

This might be me not understanding something, but in the wikidata_tekgen train data split under data/wikidata_tekgen/train, there appears to only be a single triple per sentence, and the json objects format is not the same as the rest of the jsonl files, there is no triples key, instead a sub_label, rel_label and obj_label. There are certain sentences which repeat themselves though, for example,

Resident Evil: Damnation, known as Biohazard: Damnation ( , Baiohaz\u00c4\u0081do: Damun\u00c4\u0093shon) in Japan, is a 2012 Japanese adult animated biopunk horror action film by Capcom and Sony Pictures Entertainment Japan, directed by Makoto Kamiya and produced by Hiroyuki Kobayashi.

Appears in sentences ont_1_movie_train_27 and ont_1_movie_train_612 and each have a triple. Why are these two seperate json objects ? Wouldn't it make sense to fold them into one.

In the dpedia_webnlg on the other hand, train and test jsonl files are the same and in the training data there are multiple triples per sentence. Is this normal ? Or have perhaps the wrong files been uploaded to the repository ? What this implies is that the model might only extract a single triple every time on wikidata_tekgen since certain train sentences can have multiple triples following the ontology but which aren't in the train files.

Best regards,

A. Freeman

nandana commented 2 days ago

@swissarthurfreeman sorry for the delay in responding. We are quite happy that you plan use this dataset in your thesis.

This is the notebook we used for data preparation in wikidata_tekgen. I think we have overlooked the folding aspect (same sentence, repeating with multiple ground truths). We can update the dataset by checking all occurrences of repeated sentences and folding the triples. Please feel free to send a pull request with any improvements as well. If you see substantial improvements to the datasets, we will be quite happy to collaborate to release a refined version of it.

Let us know if you need any further information. Thanks again for your feedback.