Question on tapas fine-tuning on custom data

Sharathmk99 commented 4 years ago

Hi,

I saw you PR on tapas at Huggingface Transformer repository. Wanted to check if I can use your code to fine-tune tapas on my own data? If yes, can you provide sample so that I can follow?

NielsRogge commented 4 years ago

Hi,

thank you for your interest in the Tapas algorithm, that is actually a really good question. The maintainers of Transformers are currently finishing the creation of TapasTokenizer, which prepares the data for the model. Once this is finished, I will create an entire tutorial on how to fine-tune Tapas on your own data, as well as add more documentation. Here I will write up a preliminary draft of that:

1. Choose one of the 3 ways in which you can use TAPAS - or experiment

First, it's important to note that if your dataset is rather small (hundreds of training examples), it's advised to start from an already fine-tuned checkpoint of TapasForQuestionAnswering. There are 3 different ways in which one can fine-tune an already fine-tuned TapasForQuestionAnswering checkpoint, corresponding to the different datasets on which Tapas was fine-tuned, and you should pick one of them:

SQA: if you're interested in asking follow-up questions related to a table, in a conversational set-up. For example if you first ask "what's the name of the first actor?" then you can ask a follow-up question such as "how old is he?". Here, questions do not involve any aggregation (all questions are cell selection questions).
WTQ/WikiSQL: if you're not interested in asking questions in a conversational set-up, but rather just asking questions related to a table, which might involve aggregation, such as counting a number of rows, summing up cell values or averaging cell values. You can then for example ask "what's the total number of goals Cristiano Ronaldo made in his career?". This case is also called weak supervision, since the model itself must learn the appropriate aggregation operator (SUM/COUNT/AVERAGE/NONE) given only the answer to the question as supervision.
WikiSQL-supervised: this dataset is actually the same dataset as WikiSQL, but here the model is given the ground truth aggregation operator during training. This is also called strong supervision. Here, learning the appropriate aggregation operator is much easier.

To summarize:	Task	Example datasets
Conversational	SQA	conversational, only cell selection questions
Weak supervision for aggregation	WTQ, WikiSQL	questions might involve aggregation, and the model must learn this only given the answer as supervision
Strong supervision for aggregation	WikiSQL-supervised	questions might involve aggregation, and the model must learn this given the answer and gold aggregation operator

If you want to fine-tune the classification heads of TapasForQuestionAnswering from scratch, then you can actually experiment, and you don't have to choose between one of these 3 options. You can define any hyperparameters you want when initializing TapasConfig, and then create a TapasForQuestionAnswering based on that configuration. For example, if you have a dataset that has both conversational questions and questions that might involve aggregation, then you can do it this way.

2. Prepare your data in the SQA format

Second, based on what you picked above, you should prepare your dataset in the SQA format (no matter which of the three you picked above). This format is a TSV file with the following columns:

id: id of the table-question pair, for bookkeeping purposes. Set to zero if you don't need this.
annotator: id of the person who annotated the table-question pair, for bookkeeping purposes. Set to zero if you don't need this.
position: this is an integer indicating if the question is the first, second, third,... related to the table. Only required in case of conversational setup (SQA). You don't need this column in case you're going for WTQ/WikiSQL/WikiSQL-supervised.
question: string
table_file: string, name of a csv file containing the tabular data
answer_coordinates: list of tuples (each tuple being a cell coordinate, i.e. row, column pair that is part of the answer)
answer_text: list of strings (each string being a cell value that is part of the answer)
aggregation label: only required in case of strong supervision for aggregation (the WikiSQL-supervised case)
answer_float: the float answer to the question. Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL)

If you go for the first case (conversational set-up), it should look like this: sqa_format_bis

This is from the SQA development set, so we don't have the aggregation_label and answer_float columns here. Also note that they have an id and annotator column to indicate which person annotated that question, this is only for bookkeeping purposes.

Note that the authors of the TAPAS algorithm used conversion scripts with some automated logic to convert the other datasets (WTQ and WikiSQL) into the SQA format. The author explains this here. This is actually interesting, because these conversion scripts are not perfect, meaning that WTQ and WikiSQL results could actually be improved.

3. Convert your data into PyTorch tensors using TapasTokenizer

Third, given that you've prepared your data in this TSV format (and corresponding CSV files containing the tabular data), you can then use TapasTokenizer to convert table-question pairs into input_ids, attention_mask, token_type_ids and so on. Again, based on which of the three cases you picked above, TapasForQuestionAnswering needs different things to be fine-tuned:

Task	Required
Conversational	`input_ids`, `attention_mask`, `token_type_ids`, `label_ids`
Weak supervision for aggregation	`input_ids`, `attention_mask`, `token_type_ids`, `label_ids`, `numeric_values`, `numeric_values_scale`, `answer_float`
Strong supervision for aggregation	`input ids`, `attention mask`, `token type ids`, `label ids`, `aggregation_labels`

Suppose that we want to do this for SQA, you can just do the following:

from transformers import TapasTokenizer
import pandas as pd

tokenizer = TapasTokenizer.from_pretrained("tapas-base-finetuned-sqa")
table = pd.read_csv("table_csv/203-386.csv")
question = "what boats were lost on may 5"
answer_coordinates = ['(1, 1)', '(2, 1)']   
answer_text = ['U-638', 'U-531']

inputs = tokenizer(table, question, answer_coordinates, answer_text)

Note that this is currently not possible because TapasTokenizer is undergoing some changes (the call method is being implemented), and the checkpoints are currently not in the model hub of HuggingFace. It's also possible we will not use Pandas to provide tables to the tokenizer, but rather a HuggingFace dataset.

4. Train (fine-tune) TapasForQuestionAnswering

You can then fine-tune TapasForQuestionAnswering using native PyTorch as follows (I hope you're familiar with PyTorch, if not, I highly recommend this tutorial):

from transformers import TapasForQuestionAnswering

model = TapasForQuestionAnswering.from_pretrained("tapas-base-finetuned-sqa", return_dict=True)

outputs = model(**inputs)
loss = outputs.loss
loss.backward()

In the future, it might be possible that you can use the Trainer abstraction class to train TAPAS, which will make it much easier. This is also WIP.

I will update this once there is more progress!

Sharathmk99 commented 4 years ago

Hi @NielsRogge ,

This is amazing detailed explanation of how t fine-tune. Thank you for taking time to detail the steps.

Yes i'm familiar with PyTorch. I'm waiting for TapasTokenizer implementation, so that i can start fine-tuning. Meantime i'll prepare the dataset in TSV format.

Please update once i can use your transformer repo.

shashankMadan-designEsthetics commented 3 years ago

Hi @NielsRogge, I have some code (which if possible you can use just as a reference at least) on wtq utils. We discussed the same here Id like to place PR on the same , so i created a new branch modeling_tapas_v3 and pushed but i get this error

ERROR: Permission to NielsRogge/transformers.git denied to shashankMadan-designEsthetics.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

NielsRogge commented 3 years ago

Thanks for your contribution. Where can I find your code? I looked on your Github but can't find it.

shashankMadan-designEsthetics commented 3 years ago

@NielsRogge I haven't uploaded it yet coz I keep getting access denied error on your forked repo... ERROR: Permission to NielsRogge/transformers.git denied to shashankMadan-designEsthetics. fatal: Could not read from remote repository. This is what it gives me...

NielsRogge commented 3 years ago

You can only make a PR in case you upload your code to your own Github, because a PR is just a comparison between 2 branches. The PR will then compare your branch (from your account) to my branch (from my account).

shashankMadan-designEsthetics commented 3 years ago

Understood and sry for the hassle. I fork the repo and create a pr(on your repo)

shashankMadan-designEsthetics commented 3 years ago

Hi @NielsRogge Here is the link to my repo Its basically just integration of wtq utils from original tapas code... And the new files are added: tapas_file_utils, tapas_task_utils_test, tapas_task_utils, tapas_text_utils, tapas_wtq_utils

AhmedMasryKU commented 3 years ago

Do you guys have any script that can populate the answer coordinates in my custom dataset? @NielsRogge @Sharathmk99 @shashankMadan-designEsthetics I looked at the huggingface training example, but it says that we need to follow the logic in the original tapas repo to populate the answer coordinates in our custom datasets. When I looked at the original code here, it was quite hard for me to follow up, expecially because the outputs would be in tfrecords. However, I want to use the huggingface code to finetune the model since it's easier to edit and manipulate. So the tfrecords created from here won't be useful and I need tsv files.

Thanks,

NielsRogge commented 3 years ago

Hi @AhmedMasryKU,

I've created a script that allows you to do this. I've updated this repo, check the README for more info :)

AhmedMasryKU commented 3 years ago

Thanks so much!

AhmedMasryKU commented 3 years ago

@NielsRogge Do you have any code snippet to convert the logits and aggregation logits to actual answer? I am using your huggingface code, but the model only outputs the logits and aggregation_logits so I am quite confused how to infer the final answer from these logits.

NielsRogge commented 3 years ago

Yes I've created a function especially for this, it's called convert_logits_to_predictions, see here: https://huggingface.co/transformers/model_doc/tapas.html#transformers.TapasTokenizer.convert_logits_to_predictions

ManasiPat commented 3 years ago

@NielsRogge In the comment above you have provided an example for the strong supervision format. Can you please tell me what should be the format of weak supervision?
I have put answer_coordinates , answer_text and aggregation label to be None as I dont have that kind of ground truth. Also, answer_float to be the answer I have. Do I need to have answer_coordinates to run the code? What if I don't have those as a part of my ground truth?

ManasiPat commented 3 years ago

@NielsRogge can you please do let me know hot to feed data to Hugginface TAPAS for the queries of the aggregation type , where I have the float answer which is not part of the table as it is a result of some aggregation operation. In such case what values should I feed for the answer_coordinates (I dont know these values, it is a weak supervision setting). Please respond to my question.

NielsRogge / tapas_utils