apoorvumang / kgt5

ACL 2022: Sequence-to-Sequence Knowledge Graph Completion and Question Answering (KGT5)
Apache License 2.0
97 stars 18 forks source link

KGQA data split #13

Closed ccchobits closed 2 years ago

ccchobits commented 2 years ago

Hi Apoorv,

Thank you for the nice work on KGT5.

I get a question. How can we find the data split for KGQA? I went throught the link for downloading datasets, but the KGQA dataset splits mentioned in the paper have not been included. Just wonder did I miss something, or they have not been ready for release yet?

Thanks and regards.

apoorvumang commented 2 years ago

Hi, thanks for your interest!

The data split hasn't been uploaded yet but we will upload soon. Is there any specific data that you are looking for?

ccchobits commented 2 years ago

Thank you for your reply.

We are very intested in the KGQA task on incomplete setting for research purpose. It would be really appreciated if MetaQA and WQSP of 50% split can be uploaded.

Btw, the complete subsets of Freebase you created for WQSP and CWQ are also critical. If possible, please upload them as well. Thank you for your effort.

Best regards.

apoorvumang commented 2 years ago

I have uploaded the data here: https://storage.googleapis.com/t5-kgc-colab/data/data_kgqa.zip

I will be updating the README with instructions/details on the KGQA data very soon (the current dump may be a bit confusing).

In case you want any clarifications on the data, please feel free to continue this issue thread or raise a new issue.

ccchobits commented 2 years ago

I have uploaded the data here: https://storage.googleapis.com/t5-kgc-colab/data/data_kgqa.zip

I will be updating the README with instructions/details on the KGQA data very soon (the current dump may be a bit confusing).

In case you want any clarifications on the data, please feel free to continue this issue thread or raise a new issue.

That's very nice. Thank you so much.

ccchobits commented 2 years ago

I have some confusion and might need your clarification.

Within the folder for each dataset, there are files /train.txt, /valid.txt and /qa_test.txt. I assume they are the data split for training, validation and testing by the corresponding file names. But I found /train.txt is a mixture of kgc and qa data (already explained in the paper), while /valid.txt and /qa_test.txt only cover kgc and qa data respectively. So I wonder is it true that this data split should be used in the way that training on kgc and qa, validating on kgc and testing on qa?

apoorvumang commented 2 years ago

Yes, you should use train.txt to train (combined kgc and qa training), and just to check that model is not overfitting you may use valid.txt since it is already in the processed format - however as you pointed out, it just contains kgc lines. So */valid.txt has limited utility.

*/qa_test.txt contains test questions in the following format:

question_text<tab>answer_1|answer_2|...

This will need to be processed according to the dataset (see below notes) before passing as input to the model during inference.

Notes:

  1. For ComplexWebQuestions, the KG is too big compared to QA dataset size, hence I would not recommend combined training - you should probably first train on kgc_lines and then finetune on qa_lines. So there is no combined train.txt file for that.
  2. For MetaQA and WebQuestionsSP, entity name has been replaced in question with 'NE' and input to model is in the format entity_name | question_without_entity. So this scheme will have to be followed in inference (testing) as well. For ComplexWebQuestions there is no such entity name replacement. Please explore the train files for each dataset to see the preprocessing done.

I will be adding this and more information to the README soon.

ccchobits commented 2 years ago

Thank you for your clarification.

ccchobits commented 2 years ago

In the raw text of qa_test.txt, there are a lot of non-English characters and we need to preprocess them to have exact entity name matches with the ones you have preprocessed for the other files. I have tried to solve it but failed to have exact matches for all the entities containing non-English characters. Would you like to share the way you preprocess them? Probably just one line of code. Thanks.

apoorvumang commented 2 years ago

AFAI can remember, I used unidecode library and unidecode.unidecode('actual string') was applied to all text. Lower case was also done.

screemix commented 2 years ago

Hi, @apoorvumang! Is there the same issue with KGC split for linking prediction? I think the link from the readme leads to the same dataset without a split. I wasn't sure it deserves a separate issue, but if you think so I can open one to discuss it more in details.

apoorvumang commented 2 years ago

@screemix I don't understand sorry, request you to please create a new issue with more details

apoorvumang commented 2 years ago

Closing for now, please reopen if you think this is not resolved @ccchobits