Open swissarthurfreeman opened 1 week ago
@swissarthurfreeman sorry for the delay in responding. We are quite happy that you plan use this dataset in your thesis.
This is the notebook we used for data preparation in wikidata_tekgen
. I think we have overlooked the folding aspect (same sentence, repeating with multiple ground truths). We can update the dataset by checking all occurrences of repeated sentences and folding the triples. Please feel free to send a pull request with any improvements as well. If you see substantial improvements to the datasets, we will be quite happy to collaborate to release a refined version of it.
Let us know if you need any further information. Thanks again for your feedback.
Hello,
We're trying to re-use your dataset to further study fact extraction as part of a master's thesis at the University of Geneva.
This might be me not understanding something, but in the wikidata_tekgen train data split under
data/wikidata_tekgen/train
, there appears to only be a single triple per sentence, and the json objects format is not the same as the rest of thejsonl
files, there is notriples
key, instead asub_label
,rel_label
andobj_label
. There are certain sentences which repeat themselves though, for example,Resident Evil: Damnation, known as Biohazard: Damnation ( , Baiohaz\u00c4\u0081do: Damun\u00c4\u0093shon) in Japan, is a 2012 Japanese adult animated biopunk horror action film by Capcom and Sony Pictures Entertainment Japan, directed by Makoto Kamiya and produced by Hiroyuki Kobayashi.
Appears in sentences
ont_1_movie_train_27
andont_1_movie_train_612
and each have a triple. Why are these two seperate json objects ? Wouldn't it make sense to fold them into one.In the
dpedia_webnlg
on the other hand, train and test jsonl files are the same and in the training data there are multiple triples per sentence. Is this normal ? Or have perhaps the wrong files been uploaded to the repository ? What this implies is that the model might only extract a single triple every time on wikidata_tekgen since certain train sentences can have multiple triples following the ontology but which aren't in the train files.Best regards,
A. Freeman