Fine Tuning on Custom Dataset

clovaai / bros

Apache License 2.0

155 stars 23 forks source link

Fine Tuning on Custom Dataset #3

Open siamakzd opened 2 years ago

siamakzd commented 2 years ago

Thank you very much for sharing this great work! I was wondering if there are any instructions on how to prepare custom data to be used for fine-tuning Bros. I understand there are preprocessing codes for FUNSD, but if there are summarized instructions, it will be greatly helpful.

dhkim0225 commented 2 years ago

Input data preprocess

https://github.com/clovaai/bros/blob/55c52d0872ed61fb7586b70618f45dcb0354f1b2/preprocess/funsd_spade/preprocess.py#L74-L86

The data must have 4 point quadrangle coordinates. If you have a rectangle coordinate, transform it into (8,) shape.
Tokenize transcription(GT or output of OCR) using bert tokenizer. https://github.com/clovaai/bros/blob/55c52d0872ed61fb7586b70618f45dcb0354f1b2/preprocess/funsd_spade/preprocess.py#L31

KIE task

Please refer to the code block below. https://github.com/clovaai/bros/blob/55c52d0872ed61fb7586b70618f45dcb0354f1b2/preprocess/funsd_spade/preprocess.py#L96-L116

siamakzd commented 2 years ago

Thank you!

For now I am interested in token classification task. To clarify, let's say for each document I have:

a list of words
a list of bounding boxes corresponding to those words
and a list of labels for each box

Which type of preprocessing should I do? For FUNSD I see there are two types funsd and funsd_spade. I ran both preprocessing and see that parse will be different in the processed files. I appreciate if you can tell me conceptually the reason for this difference.

tghong commented 2 years ago

Simply,

funsd: for BIO-tagging decoder
funsd_spade: for SPADE style decoder

Since BIO-tagging approach is common, I recommend using this method first.