Implementing the KGT5 pipeline for my own constructed KG

ankush9812 commented 2 years ago

Hi Apoorv Great work with KGT5 model.

I basically want to implement the entire pipeline(KGT5) for my constructed KG. I published a paper(Knowledge Graph – Deep Learning: A Case Study in Question Answering in Aviation Safety Domain) in LREC 2022 where I contributed the Aviation KG and showed results with the KG+DL QA system. Our combined QA system performance was better than individually constructed DLQA and KGQA systems. In your paper, I'll work with the training of triplets but need help in fine-tuning the model as your other branch code is not clean so not really understandable. If README can be provided it would be great.

apoorvumang commented 2 years ago

Hi Ankush, thanks for your interest.

We will be adding the QA finetuning clean code soon. However, if you want to do it yourself earlier (and are ok with modifying the code), you just need to follow these steps:

Train model on link prediction (use existing code) and save to disk. I would recommend t5-base size, and use pretrained LM initialization. This can be done by changing line 139 in main_accelerate.py to model = T5ForConditionalGeneration.from_pretrained('t5-base')
Make input-output data for QA by constructing input-output pairs as 'predict answer: \<question>' and '\<answer>' . If a question has multiple correct answers, make a qa pair for each answer. Place these as tab separated lines in train.txt of a new dataset folder e.g. mydataset_qa. I would also suggest you add some lines from the link prediction style dataset to this file as well, maybe the same number as the number of QA lines (refer to sec 3.4 regularisation scheme). Then run training with that dataset. Also, make sure to specify load_checkpoint command line arg of the link prediction model so that the pretrained model is loaded for finetuning.
For inference, you can do top-4 beam search and select highest probability answer. I would recommend either writing your own code for this, or you could also read the code in apoorv-dump branch eval_accelerate.py for reference.

ankush9812 commented 2 years ago

What is the link prediction style dataset? I have created the 5000.pt file for the triplets of my constructed KG. I have the train.txt file with question-answer pairs in the mydataset_qa folder. What will be the next steps for fine-tuning and testing? I'm unclear about it.

apoorvumang commented 2 years ago

Could you please tell me, roughly,

How many triples in your KG?
How many QA pairs in your dataset?

Based on this I can suggest you the best next steps

ankush9812 commented 2 years ago

The total number of unique triples: 96k. So, in total 190k. Split - Train-80%, Dev-10%, Test-10%.
I'm constructing QA pairs from my doc. Roughly 1k pairs but will add more if required. or can choose general QA pairs dataset such as SQuAD as creating QA pairs from aviation accident reports is tough.

Is this information sufficient? What should be my next steps for fine-tuning?

apoorvumang commented 2 years ago

Since the number of QA pairs is relatively small, I would recommend the following:

Concatenate the link prediction and qa training files. So the final train.txt will look like:

predict tail: obama | president of\tUnited States
predict head: United states | president of\tobama
predict tail: ....
.
.
predict answer: what is the currency in china?\tYuan
predict answer: ...

So the number of lines in train.txt would be: no. of train triples x 2 + no. of qa pairs

Train the model using this combined train.txt as training file

ankush9812 commented 2 years ago

Thank you for the reply. I have a doubt in KGT5, pl correct me if I'm wrong. According to me, training the LM model with small KG or domain-specific will not work very well compared to finetuning pre-trained language models(results for my KG are shown with Bert and Gpt-3 in this paper ). Can we develop a pipeline in which we use pre-trained LM and can somehow train it with triplets and fine-tune it with QA pairs? I think, currently, there is a lot of work in domain-specific KG construction. Can we train the LM model by masking the text from which KG is constructed? Then, use triplets to train it better and finally fine-tune it with QA pairs. Something similar is done in QA-GNN and GreaseLM but not exactly what I had proposed. Pl, give your comments on this.

apoorvumang commented 2 years ago

Yes, starting with pretrained LM should be extremely helpful in such a dataset. As mentioned in the first reply, you can modify the code to do this. Here are the complete steps again:

Change line 139 in main_accelerate.py to model = T5ForConditionalGeneration.from_pretrained('t5-base') so that whenever training is started, it uses the pretrained LM. Choose LM size based on your capacity - if you have a smaller dataset, you can probably afford a larger model.
Concatenate the training files for KG and LM into a single file. The final file should look as follows
```
predict tail: obama | president of\tUnited States
predict head: United states | president of\tobama
predict tail: ....
.
.
predict answer: what is the currency in china?\tYuan
predict answer: ...
```
So the number of lines in train.txt would be: no. of train triples x 2 + no. of qa pairs. Place this train.txt in a new dataset folder (e.g. mydataset)

Start training using the command in readme. For data parallel training the command could be

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node 4 --use_env ./main_accelerate.py \
--save_prefix mydataset_t5base \
--model_size base --dataset mydataset \
--batch_size 8 --save_steps 5000 \
--loss_steps 500 --epochs 10

Please choose largest batch size that doesn't cause OOM

For inference, you can do top-4 beam search and select highest probability answer. I would recommend writing your own code for this, since now this is just a finetuned t5 LM and huggingface has lots of inference options to try - eg. sampling, beam search, nucleus sampling, greedy decoding and see which works best for you

apoorvumang commented 2 years ago

Can we train the LM model by masking the text from which KG is constructed

I have not addressed this in the above comment. This can probably be done, but it is out of the scope of this repo. Please let me know if you are able to do this and it improves results

ankush9812 commented 2 years ago

Thank you for the answer.

apoorvumang / kgt5

Implementing the KGT5 pipeline for my own constructed KG #11