clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.8k stars 472 forks source link

Data Extraction Fine Tuning #204

Open mattreidy opened 1 year ago

mattreidy commented 1 year ago

I am attempting to fine tune and train a model to extract the fields highlighted in yellow in the sample reports here. My results are not good when testing the model and I'm looking for some guidance on what to do to improve the model other than obtaining more training data. I currently have only about 40 images for training and 5 each for testing and validation (50 total). The ground truth JSON I have defined follows each of the sample report images. My training config is at the end here. I'm running on a single GPU local machine.

I've added the following code to the train.py program but have no idea if it's correct or needed. if task_name == "ngs": model_module.model.decoder.add_special_tokens([ "<biomarker_findings/>, <genomic_findings/>, <tumor_mutation_burden/>, <microsatellite_status/>, <form/>", "<scientific_publication/>", "<scientific_report/>" ])

Any help would be greatly appreciated!

Sample Report 1

17538900-FoundationOne { "microsatellite status": "cannot be determined", "tumor mutation burden": "cannot be determined", "genomic findings": [ "BRCA2", "IDH1", "ARID1A", "TP53" ], "variants of unknown significance": [ "FH", "MAP3K1" ] }

Sample Report 2

57217576-FoundationOne

{ "microsatellite status": "STABLE", "tumor mutation burden": 0, "genomic findings": [ "EGFR", "NF1", "MTAP", "BCORL1", "CDKN2A/B" ], "variants of unknown significance": [ "BRCA2", "MAF", "SOX9" ] }

Training Config

resume_from_checkpoint_path: null # only used for resume_from_checkpoint option in PL result_path: "./result" pretrained_model_name_or_path: "naver-clova-ix/donut-base" # loading a pre-trained model (from moldehub or path) dataset_name_or_paths: ["/home/mattreidy/projects/ngs/data/ngs_donut"] # loading datasets (from moldehub or path) sort_json_key: False # cord dataset is preprocessed, and publicly available at https://huggingface.co/datasets/naver-clova-ix/cord-v2 train_batch_sizes: [1] val_batch_sizes: [1] input_size: [960, 1600] # even multiples of 320. original=1700x2200. max_length: 768 align_long_axis: False num_nodes: 1 seed: 2022 lr: 3e-5 warmup_steps: 800 num_training_samples_per_epoch: 8 max_epochs: 5 max_steps: -1 num_workers: 8 val_check_interval: 1.0 check_val_every_n_epoch: 2 gradient_clip_val: 1.0 verbose: True

MaxPowerWasTaken commented 1 year ago

Hey Matt,

I'm still coming up to speed on this repo myself, but it looks to me like there's a slight problem with your approach here. This line of train.py indicates to me that it will read your task_name as ngs_donut, based on your config including dataset_name_or_paths: ["/home/mattreidy/projects/ngs/data/ngs_donut"].

So I think you probably have to update your train.py edit from: if task_name == "ngs":

to:

if task_name == "ngs_donut"

Can you let me know here if that helps?

mattreidy commented 1 year ago

Hey Matt,

I'm still coming up to speed on this repo myself, but it looks to me like there's a slight problem with your approach here. This line of train.py indicates to me that it will read your task_name as ngs_donut, based on your config including dataset_name_or_paths: ["/home/mattreidy/projects/ngs/data/ngs_donut"].

So I think you probably have to update your train.py edit from: if task_name == "ngs":

to:

if task_name == "ngs_donut"

Can you let me know here if that helps?

MaxPowerWasTaken, Thank you very very much for taking the time to read and respond to my post - good eye - I did actually change the code after posting to catch the exact issue you identified. I just modified it by removing the "if" statement and add my special tokens all the time. I'm still not at all clear about what the tokens do or how to use them... Thanks again. -Matt

MaxPowerWasTaken commented 1 year ago

Hey Matt, thanks for responding, glad to hear you were able to move past your issue. Here's how I think about what a token is, in the context of Transformer/NLP models. Hope it's helpful.

NLP transformer models are learning pretty intricate statistical patterns matching either text-to-text, or in our case with donut, mapping input images to output text. So what's the atomic unit of a 'text'? That's where tokenizers (and so tokens) come in. A tokenizer splits a text into a list of atomic units of text, or tokens.

A "word tokenizer" breaks up every text into words (e.g. splitting on spaces; with maybe separate tokens for punctuation). A "character tokenizer" breaks every text into single characters (letters + punctuation). But they can be more nuanced too, like a sub-word token that might break "carrying" into ["carry", "ing"] tokens. With Donut, I believe the tokenizer by default is sub-word (might be wrong about that), but more importantly, it lets us add our own tokens as well, for whole-words or even multi-word tokens, that we want the model to also think of as atomic units of text (i.e. key terms).

So e.g. in my case with insurance data, I want a type of document called a "benefits_table" and an attribute of certain plans "OutOfPocketMax" to get their own special tokens, so that the model thinks of them as indivisible terms worth learning patterns about, and not trying to break them into separate ["benefits", "table", "out", "of", "pocket", "max"] tokens to reason about separately.

Finally, auto-regressive text-generating models (like Donut) will predict the next token (e.g. word or letter), based on: