Pix2Struct: ERROR Fine tunning overfitting in a single image

I have followed your tutorial at https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Pix2Struct to fine tune the pix2struct from the base model "google/pix2struct-base", just change the dataset to one based in funsd: "arvisioncode/donut-funsd" and "SotiriosKastanas/difffunsd".

The version of transformers I am using is 4.30.0.dev0, which is downloaded by default with: !pip install -q git+https://github.com/huggingface/transformers.git

The problem is that after performing several fine tunings with 200 (dataset: "SotiriosKastanas/difffunsd") and 1000 (dataset: "arvisioncode/donut-funsd") epochs, I have seen that the inference of these models always gives the same result, regardless of the input image we use:

from PIL import Image
import requests
from transformers import AutoProcessor, Pix2StructForConditionalGeneration

processor = AutoProcessor.from_pretrained("arvisioncode/pix2struct-funsd-200ep_v2")
model = Pix2StructForConditionalGeneration.from_pretrained("arvisioncode/pix2struct-funsd-200ep_v2")

# TEST 1
url = "82092117.png"
# TEST 2
url = "00920294.png"

#image = Image.open(requests.get(url, stream=True).raw)
image = Image.open(url)
inputs = processor(images=image, return_tensors="pt")

# conditional generation
inputs = processor(images=image, return_tensors="pt", add_special_tokens=False)

generated_ids = model.generate(**inputs, max_new_tokens=200)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

Test 1: <s_DATE:> 8/ 13/ 93</s_DATE:><s_MANUFACTURER:> AMERICAN TOBACCO COMPANY</s_MANUFACTURER:><s_cc:> R. D. Hammer</s_cc:><s_REPORTED BY:> A. REID, DIVISION MANAGER, SAN FRANCISCO, CA</s_REPORTED BY:><s_BRAND NAME:> SPECIAL 10 s</s_BRAND NAME:><s_OTHER INFORMATION:> SEE ATTACHED COPY OF CIRCULAR NO. 4848</s_OTHER INFORMATION:> Test 2: <s_DATE:> 8/ 13/ 93</s_DATE:><s_MANUFACTURER:> AMERICAN TOBACCO COMPANY</s_MANUFACTURER:><s_cc:> R. D. Hammer</s_cc:><s_REPORTED BY:> A. REID, DIVISION MANAGER, SAN FRANCISCO, CA</s_REPORTED BY:><s_BRAND NAME:> SPECIAL 10 s</s_BRAND NAME:><s_OTHER INFORMATION:> SEE ATTACHED COPY OF CIRCULAR NO. 4848</s_OTHER INFORMATION:>

And in fact, this output does not correspond to either of the two images, since this data is not found in them. That is why I think the model has been overtrained in a single example.

Do you know what could be happening and how I can fix it?

NielsRogge / Transformers-Tutorials

Pix2Struct: ERROR Fine tunning overfitting in a single image #293