clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.52k stars 443 forks source link

Not getting prediction correctly using the model trained on the custom dataset (similar format as CORD-V2 dataset) #297

Open SiriusPoint opened 2 months ago

SiriusPoint commented 2 months ago

I have trained the Donut model using custom dataset which is on the same line as CORD-v2 dataset. The image is having multiple values in one line and we have around 23 to 24 lines in each document. I have used the base model as "naver-clova-ix/donut-base". I am using 149 documents for the training and following is the breakup of the datasets training = 119 images validation = 22 images testing = 8 images

I have crated 3 meradata.jsonl file i.e. for train, validation and test. Below is the sample value from the metadat.jsonl file from the training database

{"file_name": "IOB_Bank_31_image_0.jpg", "ground_truth": "{\"gt_parse\": {\"bank_stmt_entries\": [{\"TXN_DATE\": \"02-11-2023\", \"TXN_DESC\": \"SB Int: 10-2023:0\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": null, \"DEPOSIT_AMT\": \"93.00\", \"BALANCE_AMT\": \"10901.92\"}, {\"TXN_DATE\": \"09-12-2023\", \"TXN_DESC\": \"CHRGS- SMS ALERT\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": \"1.06\", \"DEPOSIT_AMT\": null, \"BALANCE_AMT\": \"10900.86\"}, {\"TXN_DATE\": \"02-02-2024\", \"TXN_DESC\": \"Debit Card AMC-2\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": \"295.00\", \"DEPOSIT_AMT\": null, \"BALANCE_AMT\": \"10605.86\"}, {\"TXN_DATE\": \"02-02-2024\", \"TXN_DESC\": \"SB Int: 01-2024: 0\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": null, \"DEPOSIT_AMT\": \"75,00\", \"BALANCE_AMT\": \"10680.86\"}]}}"}

I trained the model for 30 epochs and following are the values for loss and val_edit_distance

loss = 0.03544 val_edit_distance = 0.3443

Following is the config parameters used for the training

When I am trying to find the prediction using the test dataset, I am getting following output because I had put the print statement at specific location

seq ==>: 署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署12-323310-3012-510-3021-32-2021-2021-2021-2021-2021-2021-2021-2021-3021-32419181mt-3021-3241.4351.4351.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.
seq after token2json ==>: {'text_sequence': '署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署12-323310-3012-510-3021-32-2021-2021-2021-2021-2021-2021-2021-2021-3021-32419181mt-3021-3241.4351.4351.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.'}
ground_truth after json load ==>: {'gt_parse': {'bank_stmt_entries': [{'TXN_DATE': '02-11-2023', 'TXN_DESC': 'SB Int: 10-2023:0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '93.00', 'BALANCE_AMT': '10901.92'}, {'TXN_DATE': '09-12-2023', 'TXN_DESC': 'CHRGS- SMS ALERT', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '1.06', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10900.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'Debit Card AMC-2', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '295.00', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10605.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'SB Int: 01-2024: 0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '75,00', 'BALANCE_AMT': '10680.86'}]}}
ground_truth ==>: {'bank_stmt_entries': [{'TXN_DATE': '02-11-2023', 'TXN_DESC': 'SB Int: 10-2023:0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '93.00', 'BALANCE_AMT': '10901.92'}, {'TXN_DATE': '09-12-2023', 'TXN_DESC': 'CHRGS- SMS ALERT', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '1.06', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10900.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'Debit Card AMC-2', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '295.00', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10605.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'SB Int: 01-2024: 0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '75,00', 'BALANCE_AMT': '10680.86'}]}
evaluator ==>: <donut.util.JSONParseEvaluator object at 0x7d697edbfc10>
score ==>: 0

I had referred following URL as reference https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb

Please help me out in identifying and revolve the issue and let me know if you need more information

Thank you in advance

CarlosSerrano88 commented 2 weeks ago

@SiriusPoint any updates? I have the same problem

SiriusPoint commented 2 weeks ago

@CarlosSerrano88, Not yet. I am trying but not getting appropriate results.

banditgoose commented 1 week ago

The transformers implementation of Donut seems to have broken saving and loading at some point. Try transformers==4.26.1 and see if that works.

dreamlychina commented 1 week ago

any updates? I have the same problem

+1

CarlosSerrano88 commented 1 week ago

with transformers==4.25.1 working perfect!