NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
8.42k stars 1.32k forks source link

Enforce JSON structure with PaliGemma/Donut #433

Open MayStepanyan opened 3 weeks ago

MayStepanyan commented 3 weeks ago

Hi @NielsRogge Thanks for the guides - they're very useful!

I'm experimenting on an image to json task where I need to extract some fields from the image. I'm using the old approach when you were adding possible json keys to the tokenizer as special tokens. The newer approach fails for me because the decoder starts making up new keys =)

My problem is sometimes the model outputs nested jsons - something like {key: {another_key : value_of_another_key}, ...} despite not having such examples in my training set. Do you have any tips on how can I enforce a specific structure of json so the model always outputs non-nested mapping (i.e. on token level I should enforce to never have situations like val).

I've experimented with Donut and PaliGemma so far, and both tend to have this issue. Intuitively I believe I should just add more training data and/or train for more epochs, but even the models I've trained for a couple of gpu-days tend to have this problem, sadly. I'd appreciate any tips and tricks you'd suggest!

P.S. Could you also tell more about why you stopped adding the keys to the tokenizer in your latest guides? Did this new approach show better results for your use-cases or is it just for simplicity?

NielsRogge commented 2 weeks ago

Hi,

This is a good question, perhaps you can leverage a framework like Outlines to enforce a given JSON schema. This works by constraining the number of tokens that can be predicted at each time step.

MayStepanyan commented 2 days ago

Thanks for the tip @NielsRogge, I'll try it out!

As for the second question, could you please expand why you've stopped adding json keys as special tokens to the tokenizer in your latest guides? Does this work better or is it for simplicity?

My experiments show that without adding them the decoder tends to hallucinate additional keys. Would love to get your perspective too

Thanks!