Open SiriusPoint opened 2 months ago
@SiriusPoint any updates? I have the same problem
@CarlosSerrano88, Not yet. I am trying but not getting appropriate results.
The transformers implementation of Donut seems to have broken saving and loading at some point. Try transformers==4.26.1 and see if that works.
any updates? I have the same problem
+1
with transformers==4.25.1 working perfect!
I have trained the Donut model using custom dataset which is on the same line as CORD-v2 dataset. The image is having multiple values in one line and we have around 23 to 24 lines in each document. I have used the base model as "naver-clova-ix/donut-base". I am using 149 documents for the training and following is the breakup of the datasets training = 119 images validation = 22 images testing = 8 images
I have crated 3 meradata.jsonl file i.e. for train, validation and test. Below is the sample value from the metadat.jsonl file from the training database
{"file_name": "IOB_Bank_31_image_0.jpg", "ground_truth": "{\"gt_parse\": {\"bank_stmt_entries\": [{\"TXN_DATE\": \"02-11-2023\", \"TXN_DESC\": \"SB Int: 10-2023:0\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": null, \"DEPOSIT_AMT\": \"93.00\", \"BALANCE_AMT\": \"10901.92\"}, {\"TXN_DATE\": \"09-12-2023\", \"TXN_DESC\": \"CHRGS- SMS ALERT\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": \"1.06\", \"DEPOSIT_AMT\": null, \"BALANCE_AMT\": \"10900.86\"}, {\"TXN_DATE\": \"02-02-2024\", \"TXN_DESC\": \"Debit Card AMC-2\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": \"295.00\", \"DEPOSIT_AMT\": null, \"BALANCE_AMT\": \"10605.86\"}, {\"TXN_DATE\": \"02-02-2024\", \"TXN_DESC\": \"SB Int: 01-2024: 0\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": null, \"DEPOSIT_AMT\": \"75,00\", \"BALANCE_AMT\": \"10680.86\"}]}}"}
I trained the model for 30 epochs and following are the values for loss and val_edit_distance
loss = 0.03544 val_edit_distance = 0.3443
Following is the config parameters used for the training
When I am trying to find the prediction using the test dataset, I am getting following output because I had put the print statement at specific location
seq ==>:署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署12-323310-3012-510-3021-32-2021-2021-2021-2021-2021-2021-2021-2021-3021-32419181mt-3021-3241.4351.4351.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.
seq after token2json ==>: {'text_sequence': '署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署12-323310-3012-510-3021-32-2021-2021-2021-2021-2021-2021-2021-2021-3021-32419181mt-3021-3241.4351.4351.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.'}
ground_truth after json load ==>: {'gt_parse': {'bank_stmt_entries': [{'TXN_DATE': '02-11-2023', 'TXN_DESC': 'SB Int: 10-2023:0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '93.00', 'BALANCE_AMT': '10901.92'}, {'TXN_DATE': '09-12-2023', 'TXN_DESC': 'CHRGS- SMS ALERT', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '1.06', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10900.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'Debit Card AMC-2', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '295.00', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10605.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'SB Int: 01-2024: 0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '75,00', 'BALANCE_AMT': '10680.86'}]}}
ground_truth ==>: {'bank_stmt_entries': [{'TXN_DATE': '02-11-2023', 'TXN_DESC': 'SB Int: 10-2023:0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '93.00', 'BALANCE_AMT': '10901.92'}, {'TXN_DATE': '09-12-2023', 'TXN_DESC': 'CHRGS- SMS ALERT', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '1.06', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10900.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'Debit Card AMC-2', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '295.00', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10605.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'SB Int: 01-2024: 0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '75,00', 'BALANCE_AMT': '10680.86'}]}
evaluator ==>: <donut.util.JSONParseEvaluator object at 0x7d697edbfc10>
score ==>: 0
I had referred following URL as reference https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb
Please help me out in identifying and revolve the issue and let me know if you need more information
Thank you in advance