TongkunGuan / CCD

[ICCV2023] Self-supervised Character-to-Character Distillation for Text Recognition
https://openaccess.thecvf.com/content/ICCV2023/papers/Guan_Self-Supervised_Character-to-Character_Distillation_for_Text_Recognition_ICCV_2023_paper.pdf
142 stars 7 forks source link

Can you update finetune_training for selfsupervised_kmeans. #17

Open machoangha opened 2 weeks ago

machoangha commented 2 weeks ago

Hi,

I have fine-tuned my model using supervised mode on my custom data. However, when I switch to selfsupervised_kmeans and add the mask file, I notice that the output shapes of the data from the train_data_loader_iter.next() method are inconsistent with those from the supervised mode.

Observations:

Context: The size of each mode is printed in the training script at this line: https://github.com/TongkunGuan/CCD/blob/543109a1e1d9acd15080abb3e4e72d68588ba493/train_finetune.py#L269.

Questions:

  1. In the paper, it seems to mention using Self-Supervised learning by creating 2 additional augmented images to form a batch of 3 torch.Size([3, 3, 32, 128]), where the second's size is the mask torch.Size([32, 128]) and the final is the affine matrix torch.Size([3, 3]). Therefore, I believe this is not compatible with the current training script.
  2. Could you please provide the fine-tuning code for selfsupervised mode?

Thank you!

TongkunGuan commented 2 weeks ago

Hi,

I have fine-tuned my model using supervised mode on my custom data. However, when I switch to selfsupervised_kmeans and add the mask file, I notice that the output shapes of the data from the train_data_loader_iter.next() method are inconsistent with those from the supervised mode.

Observations:

  • Supervised Mode Output:

    • First value of item: Size: torch.Size([3, 32, 128])
    • Second value of item: Size: torch.Size([1, 25])
  • Self-Supervised KMeans Mode Output:

    • First value of item: Size: torch.Size([3, 3, 32, 128])
    • Second value of item: Size: torch.Size([32, 128])
    • Third value of item: Size: torch.Size([3, 3])

Context: The size of each mode is printed in the training script at this line:

https://github.com/TongkunGuan/CCD/blob/543109a1e1d9acd15080abb3e4e72d68588ba493/train_finetune.py#L269

. Questions:

  1. In the paper, it seems to mention using Self-Supervised learning by creating 2 additional augmented images to form a batch of 3 torch.Size([3, 3, 32, 128]), where the second's size is the mask torch.Size([32, 128]) and the final is the affine matrix torch.Size([3, 3]). Therefore, I believe this is not compatible with the current training script.
  2. Could you please provide the fine-tuning code for selfsupervised mode?

Thank you!

We used only the supervised mode in the fine-tuning file, you can modify this file to suit your needs.

machoangha commented 1 week ago

Hi, I have fine-tuned my model using supervised mode on my custom data. However, when I switch to selfsupervised_kmeans and add the mask file, I notice that the output shapes of the data from the train_data_loader_iter.next() method are inconsistent with those from the supervised mode. Observations:

  • Supervised Mode Output:

    • First value of item: Size: torch.Size([3, 32, 128])
    • Second value of item: Size: torch.Size([1, 25])
  • Self-Supervised KMeans Mode Output:

    • First value of item: Size: torch.Size([3, 3, 32, 128])
    • Second value of item: Size: torch.Size([32, 128])
    • Third value of item: Size: torch.Size([3, 3])

Context: The size of each mode is printed in the training script at this line: https://github.com/TongkunGuan/CCD/blob/543109a1e1d9acd15080abb3e4e72d68588ba493/train_finetune.py#L269

. Questions:

  1. In the paper, it seems to mention using Self-Supervised learning by creating 2 additional augmented images to form a batch of 3 torch.Size([3, 3, 32, 128]), where the second's size is the mask torch.Size([32, 128]) and the final is the affine matrix torch.Size([3, 3]). Therefore, I believe this is not compatible with the current training script.
  2. Could you please provide the fine-tuning code for selfsupervised mode?

Thank you!

We used only the supervised mode in the fine-tuning file, you can modify this file to suit your needs.

Hi,

I would like to confirm if my understanding is correct: During the pretraining phase of CCD, the model uses the self-supervised mode. However, when fine-tuning the model for a specific task like text recognition, you switches to using the supervised mode. Is that correct?

Thank you!