Inferencing with fine tuned model with HuggingFace TableTransformerForObjectDetection

sanprit commented 1 year ago

I have finetuned Microsoft Model for Table Structure detection using this script https://github.com/microsoft/table-transformer/tree/main#model-training
It has saved a .pth model object
I can infer the model object with https://github.com/microsoft/table-transformer/blob/main/src/inference.py
But I want to use your huggingface inferencing script for inference with the fine-tuned model.
But while loading the model with the script, I am getting a warning and it is not detecting the structure :

file_path = 'images/'+images_list[0]
image = Image.open(file_path).convert("RGB")

from transformers import TableTransformerModel, TableTransformerConfig
configuration = TableTransformerConfig('structure_config.json')

from transformers import DetrFeatureExtractor
feature_extractor = DetrFeatureExtractor(config=configuration)

encoding = feature_extractor(image, return_tensors="pt")
#model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-structure-recognition")
model = TableTransformerForObjectDetection.from_pretrained("model_5.pth",config=configuration)
with torch.no_grad():
    outputs = model(**encoding)
target_sizes = [image.size[::-1]]
results = feature_extractor.post_process_object_detection(outputs, threshold=0.6, target_sizes=target_sizes)[0]
plot_results(image, results['scores'], results['labels'], results['boxes'])

Error: It is not detecting structure and throws a warning as well:


- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Ashwani-Dangwal commented 1 year ago

@sanprit Any luck in using the fine tuned model?

Prabhav55 commented 1 year ago

@NielsRogge

I am facing this issue also. Cannot seem to load any model weights since the config weight names do not match the ones present in the original Table Transformer config. Is there a way to transform the weight names to one which matches the ones for TATR?

Regards, Prabhav

NielsRogge commented 1 year ago

Can you please show the full warning? If you trained a TableTransformerForObjectDetection model, you should be able to load all weights when performing inference.

Prabhav55 commented 1 year ago

Hey,

Just to clarify -

I trained the model using code provided by the official repository (https://github.com/microsoft/table-transformer/blob/main/src/main.py).
I used the config provided by the same repo (https://github.com/microsoft/table-transformer/blob/main/src/structure_config.json)
I then used the output -> model.pth and tried to load it into TableTransformerForObjectDetection using the below code:

configuration = TableTransformerConfig('structure_config.json')
feature_extractor = DetrImageProcessor(config=configuration)
model_structure = TableTransformerForObjectDetection.from_pretrained("/home/ubuntu/DEV/tatr-finetuning/fintabnet-process/FinTabNet.c_Image_Structure_PASCAL_VOC/output/20230701070646/model_10.pth",config=configuration)

On doing this, the error I get is:

Some weights of TableTransformerForObjectDetection were not initialized from the model checkpoint at /home/ubuntu/DEV/tatr-finetuning/fintabnet-process/FinTabNet.c_Image_Structure_PASCAL_VOC/output/20230701070646/model_10.pth and are newly initialized: ['decoder.layers.1.final_layer_norm.weight', 'backbone.conv_encoder.model.layer3.2.bn3.running_var', 'backbone.conv_encoder.model.layer2.0.bn1.bias', 'backbone.conv_encoder.model.layer3.0.conv1.weight', 'backbone.conv_encoder.model.layer4.1.bn3.running_mean', 'backbone.conv_encoder.model.layer2.3.bn3.running_mean', 'encoder.layers.5.fc2.weight', 'backbone.conv_encoder.model.layer1.2.bn2.bias' ........

I have truncated the output.

Regards, Prabhav Singh

NielsRogge commented 1 year ago

Hi,

To convert checkpoints from the original repo to the HF format, I'd recommend using the conversion script: https://github.com/huggingface/transformers/blob/main/src/transformers/models/table_transformer/convert_table_transformer_original_pytorch_checkpoint_to_pytorch.py.

So for that you need to git clone the Transformers library, and then run

python src/transformers/models/table_transformer/convert_table_transformer_original_pytorch_checkpoint_to_pytorch.py

however you might need to tweak the script a bit to account for your new model

Ashwani-Dangwal commented 1 year ago

Hi @NielsRogge , I was wondering whether you made some changes to the model while uploading it to hugging face as the results from using the hugging face model and the original model present in GitHub shows variation in table structure recognition on same images.

Ashwani-Dangwal commented 1 year ago

@Prabhav55 , @sanprit Can you please share what learning rate you used while fine tuning and what were the AP after training? And also if you made more changes in any other hyper parameters.

Prabhav55 commented 1 year ago

Hi,

I am attaching the training parameters I used & the command I used to train (All from table-transformers original repo):

{
    "lr":5e-5,
    "lr_backbone":1e-5,
    "batch_size":2,
    "weight_decay":1e-4,
    "epochs":20,
    "lr_drop":1,
    "lr_gamma":0.9,
    "clip_max_norm":0.1,

    "backbone":"resnet18",
    "num_classes":6,
    "dilation":false,
    "position_embedding":"sine",
    "emphasized_weights":{},

    "enc_layers":6,
    "dec_layers":6,
    "dim_feedforward":2048,
    "hidden_dim":256,
    "dropout":0.1,
    "nheads":8,
    "num_queries":125,
    "pre_norm":true,

    "masks":false,

    "aux_loss":false,

    "mask_loss_coef":1,
    "dice_loss_coef":1,
    "ce_loss_coef":1,
    "bbox_loss_coef":5,
    "giou_loss_coef":2,
    "eos_coef":0.4,

    "set_cost_class":1,
    "set_cost_bbox":5,
    "set_cost_giou":2,

    "device":"cuda",
    "seed":42,
    "start_epoch":0,
    "num_workers":1
}

Command:

python main.py --data_type structure --config_file structure_config.json --data_root_dir /path/to/structure_data

@NielsRogge I also tried the script you mentioned - convert_table_transformer_original_pytorch_checkpoint_to_pytorch.py But for this script, the assertion on validation part are failing.

Regards, Prabhav

Ashwani-Dangwal commented 1 year ago

@Prabhav55 , Thanks for the reply. You tried to fine tune the model or trained from scratch?

Prabhav55 commented 1 year ago

@Ashwani-Dangwal I am not very sure on this. From what I could understand from the main.py (https://github.com/microsoft/table-transformer/blob/main/src/main.py) loads the base model from DETR and then trains on it.

I assumed that if the model uploaded on HF is trained using same script, my weights should also be able to load on the same HF transformer.

Ashwani-Dangwal commented 1 year ago

@Prabhav55 What i was meaning to ask was whether the model you trained was fine tuned on the check point provided b the author (pubtables1m_structure_detr_r18.pth this check point) or did you train the model from scratch using your own dataset?

NielsRogge commented 12 months ago

Hi,

Note that the logits that are verified here are the ones of the pre-trained detection and table structure recognition checkpoints. You will have different logits if you trained the model yourself.

It's definitely good practice to verify that the original model and the HF model give the same results.

Ashwani-Dangwal commented 12 months ago

@NielsRogge . Thanks for the reply, I did use both the original model and the one in the hugging face and the output of both are as follows - Output of the original model -

Output of the hugging face model-

Do you have nay idea why there is this much difference in recognizing the table structure on the same image?

NielsRogge commented 12 months ago

@Ashwani-Dangwal thanks a lot for visualizing, that seems like a bug. Are you seeing the same with the detection model?

Could you verify the logits of both the original model and the HF model on the same inputs? It could also be a difference in the postprocessing of the logits.

Ashwani-Dangwal commented 12 months ago

@NielsRogge How do i print the logits of the hugging face model? Also the post processing steps are same for inferencing for both the models, which are taken from the repo of Brandon Smock(author of the original model). Mainly the functions used are 'objects_to_structures', 'structure_to_cells', and 'cells_to_csv', these three functions sum up all the post processing steps in the file postprocess.py in the original model repo. Also why is that if I comment out the max size parameter in the original code for inferencing with the original model -

Then the detection model is not able to detect the table with accuracy. For example - Table detected region when the max size parameter is commented out - Table detected region when max_size parameter is stated-

However if in the hugging face model if i use the following line - feature_extractor = DetrFeatureExtractor(do_resize=True, max_size=800) or even if i dont use the parametrs and just write feature_extractor = DetrFeatureExtractor() Then still i have the same result which is as follows-

WalidHadri-Iron commented 12 months ago

@Ashwani-Dangwal To get the logits using HF model, it is in the output of the model, you can have it using

model(**encoding).logits

Could this problem be coming from the pre-processing of the image? @NielsRogge

Ashwani-Dangwal commented 12 months ago

@Ashwani-Dangwal To get the logits using HF model, it is in the output of the model, you can have it using

model(**encoding).logits

Thankyou

NielsRogge commented 12 months ago

@Ashwani-Dangwal thanks for providing but there's no need to pollute the thread with all values, just posting the first 3 of both the original and HF logits suffice. Also make sure that the inputs were prepared in the same way to obtain those logits.

Ashwani-Dangwal commented 12 months ago

@NielsRogge , Sorry about that, deleted that post. Here is the logits of the hugging face model -

tensor([[[-1.1852e+01, -5.1195e+00, 8.9091e+00, -7.7407e+00, -4.9734e+00, -3.5293e+00, 1.1821e+00], [-1.0989e+01, -6.1581e+00, -2.7933e+00, -3.7990e+00, -6.2274e+00, -5.7886e+00, 3.3452e+00], [-2.7404e+01, -8.7458e+00, -7.2998e+00, -1.3358e+01, -1.1446e+01, -9.4610e-01, 4.0263e+00]

Here are the logits of original model -

tensor([[[-1.3317e+01, -6.4428e+00, 7.6415e+00, -8.4886e+00, -5.4992e+00, -3.6403e+00, 1.7129e+00], [-1.4038e+01, -7.8999e+00, -1.3723e+00, -4.5212e+00, -5.2498e+00, -5.5131e+00, 3.1864e+00], [-2.1346e+01, -9.2924e+00, -4.1696e+00, -1.0014e+01, -5.8311e+00, -1.5367e+00, 2.4998e+00]

I can confirm that input were prepared in the same way with same amount of padding added after detecting the table and all the parameters like max_resize and everything are same.

NielsRogge commented 12 months ago

@Ashwani-Dangwal could you share the code snippets used to generate the above visualizations (perhaps as a Github gist)?

WalidHadri-Iron commented 12 months ago

Hi,

I am attaching the training parameters I used & the command I used to train (All from table-transformers original repo):

{
    "lr":5e-5,
    "lr_backbone":1e-5,
    "batch_size":2,
    "weight_decay":1e-4,
    "epochs":20,
    "lr_drop":1,
    "lr_gamma":0.9,
    "clip_max_norm":0.1,

    "backbone":"resnet18",
    "num_classes":6,
    "dilation":false,
    "position_embedding":"sine",
    "emphasized_weights":{},

    "enc_layers":6,
    "dec_layers":6,
    "dim_feedforward":2048,
    "hidden_dim":256,
    "dropout":0.1,
    "nheads":8,
    "num_queries":125,
    "pre_norm":true,

    "masks":false,

    "aux_loss":false,

    "mask_loss_coef":1,
    "dice_loss_coef":1,
    "ce_loss_coef":1,
    "bbox_loss_coef":5,
    "giou_loss_coef":2,
    "eos_coef":0.4,

    "set_cost_class":1,
    "set_cost_bbox":5,
    "set_cost_giou":2,

    "device":"cuda",
    "seed":42,
    "start_epoch":0,
    "num_workers":1
}

Command:

python main.py --data_type structure --config_file structure_config.json --data_root_dir /path/to/structure_data

@NielsRogge I also tried the script you mentioned - convert_table_transformer_original_pytorch_checkpoint_to_pytorch.py But for this script, the assertion on validation part are failing.

Regards, Prabhav

I just used the conversion code, it's working fine. Of course if you keep the two next assertions and the weights are not the same, they are going to fail.

assert torch.allclose(outputs.logits[0, :3, :3], expected_logits, atol=1e-4)

assert torch.allclose(outputs.pred_boxes[0, :3, :3], expected_boxes, atol=1e-4)

Prabhav55 commented 12 months ago

Hi, I am attaching the training parameters I used & the command I used to train (All from table-transformers original repo):
{
    "lr":5e-5,
    "lr_backbone":1e-5,
    "batch_size":2,
    "weight_decay":1e-4,
    "epochs":20,
    "lr_drop":1,
    "lr_gamma":0.9,
    "clip_max_norm":0.1,

    "backbone":"resnet18",
    "num_classes":6,
    "dilation":false,
    "position_embedding":"sine",
    "emphasized_weights":{},

    "enc_layers":6,
    "dec_layers":6,
    "dim_feedforward":2048,
    "hidden_dim":256,
    "dropout":0.1,
    "nheads":8,
    "num_queries":125,
    "pre_norm":true,

    "masks":false,

    "aux_loss":false,

    "mask_loss_coef":1,
    "dice_loss_coef":1,
    "ce_loss_coef":1,
    "bbox_loss_coef":5,
    "giou_loss_coef":2,
    "eos_coef":0.4,

    "set_cost_class":1,
    "set_cost_bbox":5,
    "set_cost_giou":2,

    "device":"cuda",
    "seed":42,
    "start_epoch":0,
    "num_workers":1
}
Command:
python main.py --data_type structure --config_file structure_config.json --data_root_dir /path/to/structure_data
@NielsRogge I also tried the script you mentioned - convert_table_transformer_original_pytorch_checkpoint_to_pytorch.py But for this script, the assertion on validation part are failing. Regards, Prabhav
I just used the conversion code, it's working fine. Of course if you keep the two next assertions and the weights are not the same, they are going to fail.

assert torch.allclose(outputs.logits[0, :3, :3], expected_logits, atol=1e-4)

assert torch.allclose(outputs.pred_boxes[0, :3, :3], expected_boxes, atol=1e-4)

Thanks a lot! I figured the same. Are you able to infer with the hugging face inference method by loading the state dict? I was still getting a similar error after that.

WalidHadri-Iron commented 12 months ago

@Prabhav55 If you did the conversion, I loaded the model using

TableTransformerForObjectDetection.from_pretrained(model_folder_path)

Where basically the model_folder_path is the path to the folder where you put the three files you got from the conversion.

Ashwani-Dangwal commented 12 months ago

@NielsRogge I have added you as a collaborator you can check out the code for inference and visualization. Thankyou.

NielsRogge / Transformers-Tutorials

Inferencing with fine tuned model with HuggingFace TableTransformerForObjectDetection #316