Open Sharathmk99 opened 4 years ago
Hi,
thank you for your interest in the Tapas algorithm, that is actually a really good question. The maintainers of Transformers are currently finishing the creation of TapasTokenizer
, which prepares the data for the model. Once this is finished, I will create an entire tutorial on how to fine-tune Tapas on your own data, as well as add more documentation. Here I will write up a preliminary draft of that:
First, it's important to note that if your dataset is rather small (hundreds of training examples), it's advised to start from an already fine-tuned checkpoint of TapasForQuestionAnswering
. There are 3 different ways in which one can fine-tune an already fine-tuned TapasForQuestionAnswering
checkpoint, corresponding to the different datasets on which Tapas was fine-tuned, and you should pick one of them:
To summarize: | Task | Example datasets | Description |
---|---|---|---|
Conversational | SQA | conversational, only cell selection questions | |
Weak supervision for aggregation | WTQ, WikiSQL | questions might involve aggregation, and the model must learn this only given the answer as supervision | |
Strong supervision for aggregation | WikiSQL-supervised | questions might involve aggregation, and the model must learn this given the answer and gold aggregation operator |
If you want to fine-tune the classification heads of TapasForQuestionAnswering
from scratch, then you can actually experiment, and you don't have to choose between one of these 3 options. You can define any hyperparameters you want when initializing TapasConfig
, and then create a TapasForQuestionAnswering
based on that configuration. For example, if you have a dataset that has both conversational questions and questions that might involve aggregation, then you can do it this way.
Second, based on what you picked above, you should prepare your dataset in the SQA format (no matter which of the three you picked above). This format is a TSV file with the following columns:
id
: id of the table-question pair, for bookkeeping purposes. Set to zero if you don't need this.annotator
: id of the person who annotated the table-question pair, for bookkeeping purposes. Set to zero if you don't need this.position
: this is an integer indicating if the question is the first, second, third,... related to the table. Only required in case of conversational setup (SQA). You don't need this column in case you're going for WTQ/WikiSQL/WikiSQL-supervised.question
: stringtable_file
: string, name of a csv file containing the tabular dataanswer_coordinates
: list of tuples (each tuple being a cell coordinate, i.e. row, column pair that is part of the answer)answer_text
: list of strings (each string being a cell value that is part of the answer)aggregation label
: only required in case of strong supervision for aggregation (the WikiSQL-supervised case)answer_float
: the float answer to the question. Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL)If you go for the first case (conversational set-up), it should look like this:
This is from the SQA development set, so we don't have the aggregation_label
and answer_float
columns here. Also note that they have an id
and annotator
column to indicate which person annotated that question, this is only for bookkeeping purposes.
Note that the authors of the TAPAS algorithm used conversion scripts with some automated logic to convert the other datasets (WTQ and WikiSQL) into the SQA format. The author explains this here. This is actually interesting, because these conversion scripts are not perfect, meaning that WTQ and WikiSQL results could actually be improved.
Third, given that you've prepared your data in this TSV format (and corresponding CSV files containing the tabular data), you can then use TapasTokenizer
to convert table-question pairs into input_ids
, attention_mask
, token_type_ids
and so on. Again, based on which of the three cases you picked above, TapasForQuestionAnswering
needs different things to be fine-tuned:
Task | Required |
---|---|
Conversational | input_ids , attention_mask , token_type_ids , label_ids |
Weak supervision for aggregation | input_ids , attention_mask , token_type_ids , label_ids , numeric_values , numeric_values_scale , answer_float |
Strong supervision for aggregation | input ids , attention mask , token type ids , label ids , aggregation_labels |
Suppose that we want to do this for SQA, you can just do the following:
from transformers import TapasTokenizer
import pandas as pd
tokenizer = TapasTokenizer.from_pretrained("tapas-base-finetuned-sqa")
table = pd.read_csv("table_csv/203-386.csv")
question = "what boats were lost on may 5"
answer_coordinates = ['(1, 1)', '(2, 1)']
answer_text = ['U-638', 'U-531']
inputs = tokenizer(table, question, answer_coordinates, answer_text)
Note that this is currently not possible because TapasTokenizer is undergoing some changes (the call method is being implemented), and the checkpoints are currently not in the model hub of HuggingFace. It's also possible we will not use Pandas to provide tables to the tokenizer, but rather a HuggingFace dataset.
You can then fine-tune TapasForQuestionAnswering
using native PyTorch as follows (I hope you're familiar with PyTorch, if not, I highly recommend this tutorial):
from transformers import TapasForQuestionAnswering
model = TapasForQuestionAnswering.from_pretrained("tapas-base-finetuned-sqa", return_dict=True)
outputs = model(**inputs)
loss = outputs.loss
loss.backward()
In the future, it might be possible that you can use the Trainer
abstraction class to train TAPAS, which will make it much easier. This is also WIP.
I will update this once there is more progress!
Hi @NielsRogge ,
This is amazing detailed explanation of how t fine-tune. Thank you for taking time to detail the steps.
Yes i'm familiar with PyTorch. I'm waiting for TapasTokenizer
implementation, so that i can start fine-tuning. Meantime i'll prepare the dataset in TSV format.
Please update once i can use your transformer repo.
Hi @NielsRogge,
I have some code (which if possible you can use just as a reference at least) on wtq utils.
We discussed the same here
Id like to place PR on the same , so i created a new branch modeling_tapas_v3
and pushed but i get this error
ERROR: Permission to NielsRogge/transformers.git denied to shashankMadan-designEsthetics.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Thanks for your contribution. Where can I find your code? I looked on your Github but can't find it.
@NielsRogge
I haven't uploaded it yet coz I keep getting access denied error on your forked repo...
ERROR: Permission to NielsRogge/transformers.git denied to shashankMadan-designEsthetics. fatal: Could not read from remote repository.
This is what it gives me...
You can only make a PR in case you upload your code to your own Github, because a PR is just a comparison between 2 branches. The PR will then compare your branch (from your account) to my branch (from my account).
Understood and sry for the hassle. I fork the repo and create a pr(on your repo)
Hi @NielsRogge
Here is the link to my repo
Its basically just integration of wtq utils from original tapas code...
And the new files are added: tapas_file_utils, tapas_task_utils_test, tapas_task_utils, tapas_text_utils, tapas_wtq_utils
Do you guys have any script that can populate the answer coordinates in my custom dataset? @NielsRogge @Sharathmk99 @shashankMadan-designEsthetics I looked at the huggingface training example, but it says that we need to follow the logic in the original tapas repo to populate the answer coordinates in our custom datasets. When I looked at the original code here, it was quite hard for me to follow up, expecially because the outputs would be in tfrecords. However, I want to use the huggingface code to finetune the model since it's easier to edit and manipulate. So the tfrecords created from here won't be useful and I need tsv files.
Thanks,
Hi @AhmedMasryKU,
I've created a script that allows you to do this. I've updated this repo, check the README for more info :)
Thanks so much!
@NielsRogge Do you have any code snippet to convert the logits and aggregation logits to actual answer? I am using your huggingface code, but the model only outputs the logits and aggregation_logits so I am quite confused how to infer the final answer from these logits.
Yes I've created a function especially for this, it's called convert_logits_to_predictions
, see here: https://huggingface.co/transformers/model_doc/tapas.html#transformers.TapasTokenizer.convert_logits_to_predictions
@NielsRogge In the comment above you have provided an example for the strong supervision format. Can you please tell me what should be the format of weak supervision?
I have put answer_coordinates , answer_text and aggregation label to be None as I dont have that kind of ground truth. Also, answer_float to be the answer I have. Do I need to have answer_coordinates to run the code? What if I don't have those as a part of my ground truth?
@NielsRogge can you please do let me know hot to feed data to Hugginface TAPAS for the queries of the aggregation type , where I have the float answer which is not part of the table as it is a result of some aggregation operation. In such case what values should I feed for the answer_coordinates (I dont know these values, it is a weak supervision setting). Please respond to my question.
Hi,
I saw you PR on tapas at Huggingface Transformer repository. Wanted to check if I can use your code to fine-tune tapas on my own data? If yes, can you provide sample so that I can follow?