Enforce JSON structure with PaliGemma/Donut

MayStepanyan commented 5 months ago

Hi @NielsRogge Thanks for the guides - they're very useful!

I'm experimenting on an image to json task where I need to extract some fields from the image. I'm using the old approach when you were adding possible json keys to the tokenizer as special tokens. The newer approach fails for me because the decoder starts making up new keys =)

My problem is sometimes the model outputs nested jsons - something like {key: {another_key : value_of_another_key}, ...} despite not having such examples in my training set. Do you have any tips on how can I enforce a specific structure of json so the model always outputs non-nested mapping (i.e. on token level I should enforce to never have situations like val).

I've experimented with Donut and PaliGemma so far, and both tend to have this issue. Intuitively I believe I should just add more training data and/or train for more epochs, but even the models I've trained for a couple of gpu-days tend to have this problem, sadly. I'd appreciate any tips and tricks you'd suggest!

P.S. Could you also tell more about why you stopped adding the keys to the tokenizer in your latest guides? Did this new approach show better results for your use-cases or is it just for simplicity?

NielsRogge commented 5 months ago

Hi,

This is a good question, perhaps you can leverage a framework like Outlines to enforce a given JSON schema. This works by constraining the number of tokens that can be predicted at each time step.

MayStepanyan commented 5 months ago

Thanks for the tip @NielsRogge, I'll try it out!

As for the second question, could you please expand why you've stopped adding json keys as special tokens to the tokenizer in your latest guides? Does this work better or is it for simplicity?

My experiments show that without adding them the decoder tends to hallucinate additional keys. Would love to get your perspective too

Thanks!

NielsRogge commented 4 months ago

The main reason why I stopped adding them was because of tutorial complexity, especially for IDEFICS-2 the logic is pretty complicated with pushing/pulling the custom embeddings to/from the hub.

MayStepanyan commented 4 months ago

got it. Thanks!

NielsRogge commented 4 months ago

However PEFT apparently does support resizing token embeddings: https://huggingface.co/docs/peft/v0.11.0/en/developer_guides/troubleshooting#extending-the-vocabulary. However for me it results in OOM when leveraging PEFT's prepare_model_for_kbit_training.

NielsRogge / Transformers-Tutorials

Enforce JSON structure with PaliGemma/Donut #433