dhlab-epfl / dhSegment

Generic framework for historical document processing
https://dhlab-epfl.github.com/dhSegment
GNU General Public License v3.0
373 stars 116 forks source link

How to use multilabel prediction type? #25

Closed duchengyao closed 5 years ago

duchengyao commented 5 years ago

when i change prediction_type from CLASSIFICATION' to 'MULTILABEL

result.shape[1] > 3, "The number of columns should be greater in multi-label framework"

so how to use multi-label?

Thanks!

solivr commented 5 years ago

Hi @duchengyao , I've added a section in the documentation for the multilabel classification. Let me know if it is still unclear.

Rami6786 commented 3 years ago

I'm also facing the above issue, It's still unclear for me, Could you please provide more info?

tralfamadude commented 3 years ago

I ended up using classifier instead of multilabel. If your annotations do not overlap, then use classification mode. It works for up to 7 labels (that number is hardwired into the model).

The way multilabel works is you have to deal with overlaps of annotations. Each overlap combination essentially makes a new kind of classification.

Rami6786 commented 3 years ago

In my case, it is overlapping, for example, I need to annotate Table, Header and Sub headers. Headers and Sub headers are inside table, How can we annotate in this case? Should we use CLASSIFICATION or MULTITABLE ?

tralfamadude commented 3 years ago

You definitely have multilabel then.

Rami6786 commented 3 years ago

Thank you for your response.. I'm using the below 5 colors in the image annotation.

black- for backround, red- table , yellow- header, green - sub header, blue - title in the page.

I've updated my classes.txt file like below. is it correct?? , to use multilabel, should we change any method?

0 0 0 0 0 0 0 0 255 0 0 1 0 0 0 0 255 255 0 0 1 0 0 0 0 255 0 0 0 1 0 0 0 0 255 0 0 0 1 0

tralfamadude commented 3 years ago

That does not look right. You should have some bits that correspond to overlap regions. I assume header and subheader should always overlap with table, so header and subheader colors need to have the bit mask with two bits set. In your examples, all your bit masks (referred in doc as attribution code) have only 1 bit set.

Another point: number of bits in the bit mask is the number of primary labels/classes table, header, subheader. That means you need 3 mask bits, not 5. The documentation implies it is not necessary to represent all possible bit mask combinations.

Assuming a that a header and subheader must always fully overlap table and these colors: table = red header = green subheader = blue

This would work:

comments

0 0 0 0 0 0 # background 255 0 0 1 0 0 # table 0 255 0 1 1 0 # header 0 0 255 1 0 1 # subheader

Notice the only colors are R, G, and B. The colors are arbitrary and only need to be distinct for each combination you need.

Look at demo.py and the labels plane for output, that slice will have integers representing a predicted label for each pixel. Somehow that will decode into table/header/subheader. Perhaps it will use an int in the labels plane to represent the bit mask values, 4, 6, 5 for table, header, subheader.

Rami6786 commented 3 years ago

Got it, thank you so much for detailed information, and I have 2 queries

  1. I have used yellow for sub header, so can I use 255 255 0 1 1 0 instead of 0 255 0 1 1 0 ? or would you recommend me to use only Red, Green and Blue ?

  2. I need to use an another one color also for title of the page, so what color would recommend me to use ? and how the mask bit will be?

It will be more helpful if you could help on this. Thank you.

Rami6786 commented 3 years ago

For reference, I've attached the below sample for headers with sub-headers. ( Used Red color for table, Yellow color- header and Green - Sub-header) ...

Original image

multiple2

Annotated image

multiple2

tralfamadude commented 3 years ago

Choice of color does not matter, you just need to make sure each bit mask combo has a unique color. For example, if we add title as a label, then this would work:

0 0 0 0 0 0 0 # background 255 0 0 1 0 0 0 # table 0 255 0 1 1 0 0 # header 0 0 255 1 0 1 0 # subheader 0 128 255 0 0 0 1 # title

Rami6786 commented 3 years ago

Got it, thank you.. Should I do any modification in demo.py file? and also the above annotated image looks fine ?

tralfamadude commented 3 years ago

demo.py will need plenty of modification for postprocessing. You can see what I did in:

https://github.com/tralfamadude/dhSegment/blob/master/ia_predict.py https://github.com/tralfamadude/dhSegment/blob/master/ia_postprocess.py

Look at what I did for debug mode; I saved probability maps, _rect.jpg has the predicted rectangles, and __boxes.jpg and those might be helpful for you to know what is going on. You could OCR the predicted rectangles, for instance.

In my case, the OCR is already done and put into hocr format, so I use the rectangle coord. to extract text from that.

Rami6786 commented 3 years ago

Thank you so much @tralfamadude , I'll look into that.

Rami6786 commented 3 years ago

@tralfamadude Hi, what is the ratio of images count we should maintain for train and evaluation ?

For example , if I have 300 images and labels, Can I keep 200 in train folder, 100 images for evaluation?

tralfamadude commented 3 years ago

Using 80% train is normal. What really matters is performance on withheld examples (often called the test set).

Rami6786 commented 3 years ago

test set means the remaining 20 % ?

tralfamadude commented 3 years ago

Example: train: 160 eval: 40 test: 100 (withheld)

Terminology (eval vs. test) is not consistent in the field, so I use 'withheld' to specify the set that is not part of the training loop. The eval set is part of the training loop, even though it is not trained upon; by virtue of being the measure of training accuracy, it is possible to overfit on the combined eval+training sets. Using a test/withheld set you can check generalization.

The dhSegment demo.py can be used to run the withheld test set using the trained model. If you look at my fork of dhSegement, see ia_predict.py which shows more about post-processing.

Rami6786 commented 3 years ago

Thank you for detailed info..

Got it, it is input images that we are giving for testing, right?

tralfamadude commented 3 years ago

I used https://github.com/tralfamadude/dhSegment/blob/master/ia_predict.py in two phases: post-model training vs. production. Vision needs plenty of post-processing and for my case I need to extract text that is conditional upon 2 classifications being present on the same page/image. To do that, I used a post model decision tree in a stacked approach. In the post-model training, the training+eval sets are the X for type of page Y and that trains the decision tree. In production mode, the decision tree is used to direct post-processing (what actions to take for each page type).

tralfamadude commented 3 years ago

input images that we are giving for testing

Yes, the U shaped NN dhSegment uses is focused on training a pixel to pixel mapping. Then post-processing is used to make something from that.

Rami6786 commented 3 years ago

Got it... thank you so much..

What are the best training parameters I can use to improve the accuracy?? I tried n_epochs =30 and n_epochs =60, I'm not getting good accuracy for table headers...

Below is the config file, what other parameters I can change?

{ "training_params" : { "learning_rate": 5e-5, "batch_size": 1, "make_patches": false, "training_margin" : 0, "n_epochs": 30, "data_augmentation" : true, "data_augmentation_max_rotation" : 0.2, "data_augmentation_max_scaling" : 0.2, "data_augmentation_flip_lr": true, "data_augmentation_flip_ud": true, "data_augmentation_color": false, "evaluate_every_epoch" : 10 }, "pretrained_model_name" : "resnet50", "prediction_type": "MULTILABEL", "train_data" : "myfolder/train/", "eval_data" : "myfolder/val_a1", "classes_file" : "myfolder/train/classes.txt", "model_output_dir" : "page_model", "gpu" : "" }

tralfamadude commented 3 years ago

I have tried varying the number of epochs, but the default has been best for me. In general, if your accuracy needs improving, then get more training data.

Rami6786 commented 3 years ago

Oh OK, what about other parameters such as batch_size, data_augmentation_flip_lr, evaluate_every_epoch... and etc...? all can be set as default? and only I can focus on getting more training data? to improve the accuracy..

tralfamadude commented 3 years ago

I find that more training data will give the best improvement, but you can try variations of batch size, etc. in a grid search for best parameters. Let me know what you find out.

Rami6786 commented 3 years ago

Thank you so much for your replies @tralfamadude, It is very useful for me..

Sure, I'll try some variations and let you know.

Rami6786 commented 3 years ago

@tralfamadude After increasing the count of images, very rarely model is getting trained successfully, most of the time, it is getting stopped with the below error.. It also takes more than 2 hours for training, then it gets stopped with the below error...

Any idea on this issue??

InvalidArgumentError (see above for traceback): Incompatible shapes: [1,4,776,747] vs. [1,4,776,746] [[node sigmoid_xentropy_loss/per_pixel_loss/mul (defined at /PDF_Backend/Dh_segment/dh_segment/estimator_fn.py:119) ]]

tralfamadude commented 3 years ago

I have not seen that error. You should post the stracktrace as a new issue and tag SeguinBe who has been very helpful. It seems like an internal error since the system can handle mixed image sizes.

Rami6786 commented 3 years ago

@tralfamadude Sure.. Thank you for your assistance.

Rami6786 commented 3 years ago

After increasing the batch_size as 2 (from 1) the above issue is solved. @tralfamadude ..

tralfamadude commented 3 years ago

The "Incompatible shapes" error went away when you increased the batch size?

Rami6786 commented 3 years ago

Yes..

Rami6786 commented 3 years ago

@tralfamadude It takes more than 5 hours for training 300 + images...How do we reduce the training time ??