allenai / unifiedqa

UnifiedQA: Crossing Format Boundaries With a Single QA System
https://arxiv.org/abs/2005.00700
Apache License 2.0
426 stars 43 forks source link

low loss in fine tuning but generated answers are not correct #46

Closed cnut1648 closed 2 years ago

cnut1648 commented 2 years ago

Hi, I am fine tuning a QA dataset using huggingface unified v2 t5 large, and the sample code is like below

# training
model_inputs = self.tokenizer(questions,
                padding=True, truncation=True, 
                max_length=self.tokenizer.model_max_length, return_tensors="pt").to(device)
with self.tokenizer.as_target_tokenizer():
    labels = self.tokenizer(answers,
                    padding=True, truncation=True, 
                    max_length=self.tokenizer.model_max_length, return_tensors="pt").to(device)
    # ignore pad token for loss
    labels["input_ids"][
                labels["input_ids"] == self.tokenizer.pad_token_id
    ] = -100
    model_inputs["labels"] = labels["input_ids"]
outputs = self.model(**model_inputs)
loss = outputs.loss

# generate
model_inputs = self.tokenizer(questions, 
                padding=True, truncation=True, 
                max_length=self.tokenizer.model_max_length, return_tensors="pt").to(device)

sampled_outputs = self.model.generate(**model_inputs, 
                num_beams=4, max_length=50, early_stopping=True)

I can get fairly low loss (0.41) after fine tuning for around 5 epochs, yet the generated answers are mostly wrong (0.23 accuracy). According to T5 doc it seems that generate can handle the prepending of pad token. Also, the generated answers indeed belong to one of the choices, it is just that they are not the correct ones. I am wondering what might be the issue. Thanks!

danyaljj commented 2 years ago

I am not sure; this is certainly not a common issue. Do you observe similar issues when you use older models (e.g., https://huggingface.co/allenai/unifiedqa-t5-large) or the vanilla T5?

cnut1648 commented 2 years ago

@danyaljj Thanks for the reply. I am currently training an older unifiedqa model and will update the results when it's ready. Also, I found that even generating answers in the training set gets pretty poor results (0.4 accuracy). The question I have looks like this

What is ... \\n (A) answer A (B) answer B ... \\n context

(similar to the RACE example in the demo) while the answer is answer A. And everything will be mapped to lowercase. Although the bart example has a flag that prepends bos token for both question and answer, I choose to not prepend it due to T5 not having a bos token. Do you think I have made mistakes here? Thanks again!

Edit: after digging into examples of how to fine tune a T5, for example here it seems that to fine tune a vanilla T5 we need to append </s> to both input and label. I am wondering for unifiedqa is that still required?

danyaljj commented 2 years ago

I am wondering for unifiedqa is that still required?

I am not sure -- our models were originally trained with TensorFlow. So, I am not aware of any HF-specific specificities. There might also be issues/bugs HF; so you may want to try different versions.

One thing that I should add is that the v2 models are pretty new and might have issues that I am unaware of. So I would strongly recommend starting your experiments with the older models.

danyaljj commented 2 years ago

You can also compare the predictions of the "large" model here: https://unifiedqa.apps.allenai.org/

cnut1648 commented 2 years ago

Thanks @danyaljj! After a week's attempt I think I somehow solved this problem. In my case, it seems that fine tuning more epochs will work. Previously I was fine tuning either 5 or 10 epochs, and got 0.23 accuracy. When fine tuning for 50 epochs, I can get 0.72 accuracy. I wonder that in your paper did you also fine tune with large epoch? Thanks!!

danyaljj commented 2 years ago

We did not track "epochs". We trained the models for several hundred "steps" but our data was extremely large (in the order of millions).

cnut1648 commented 2 years ago

Oh I see. Regardless, I think the lesson I learned is that if the performance is not correlated with the loss we can give unifiedqa a longer training epochs/steps. Thank you for the help all the way @danyaljj!!