HCA97 / Mosquito-Classifiction

7th place solution of Aicrowd Mosquito Alert Competition
GNU General Public License v3.0
1 stars 1 forks source link

Use label smoothing #20

Closed fkemeth closed 8 months ago

fkemeth commented 10 months ago

https://arxiv.org/pdf/1512.00567.pdf

https://arxiv.org/abs/1906.02629

HCA97 commented 9 months ago

Lucky PyTorch already supports label smoothing in CrossEntropyLoss

fkemeth commented 9 months ago

Good call - I am running a training with label smoothing (eps=0.1) right now.

fkemeth commented 9 months ago

Training finished, but the metrics are not better than without label smoothing. Maybe I will train again with smaller epsilon (so far I had it at 0.1)

version 1 in notebook:

'epoch=3-val_loss=0.6700520515441895-val_f1_score=0.7779023051261902-val_multiclass_accuracy=0.7659265995025635.ckpt' 'epoch=4-val_loss=0.7104724645614624-val_f1_score=0.794736385345459-val_multiclass_accuracy=0.7664641737937927.ckpt'

fkemeth commented 9 months ago

Hi @HCA97,

I had a closer look at the results. image I think training longer might still give us f1 scores of >0.8. Also, the training loss looks quite wiggly - I wonder if the learning rate might be a bit too large. However, I am not sure if I understand the LR schedule that we use (get_linear_schedule_with_warmup) and the number of warmup steps/train steps. I also think since I have a smaller batch size the number of steps should possibly be different. Do you have any experience with this?

fkemeth commented 9 months ago

Also the train accuracy and f1 scores are quite bad - I assume it is either because of dropout or batchnorm. (batch norm is quite bad if batches are small). I think it would make sense to replace batchnorm with layer norm in the head. As far as I understand it, it should be superior than batch norm. What do you think? image

HCA97 commented 9 months ago

I think the train loss is very noisy and the train f1 score looks terrible because we do oversampling.

I think the train loss and train f1 score are poor because of data augmentation. When we compute the validation score we are not using any data augmentation. I wonder if we reduce the dropout rate would it get a higher score?

Okay I don't think dropout will not change anything because when computing the F1 score without data augmentation we get a very high score.

HCA97 commented 9 months ago

Also the train accuracy and f1 scores are quite bad - I assume it is either because of dropout or batchnorm. (batch norm is quite bad if batches are small). I think it would make sense to replace batchnorm with layer norm in the head. As far as I understand it, it should be superior than batch norm. What do you think?

trying different normalization make sense we can try layernorm and no normalization. I used batchnorm just to be safe if the backbone values were too high. I don't think the batch size is too small because when I tested on different batch sizes (8, 16, 32 and 64) the batch size of 16 performed the best.

HCA97 commented 9 months ago

Training finished, but the metrics are not better than without label smoothing. Maybe I will train again with smaller epsilon (so far I had it at 0.1)

version 1 in notebook:

'epoch=3-val_loss=0.6700520515441895-val_f1_score=0.7779023051261902-val_multiclass_accuracy=0.7659265995025635.ckpt' 'epoch=4-val_loss=0.7104724645614624-val_f1_score=0.794736385345459-val_multiclass_accuracy=0.7664641737937927.ckpt'

I think the result looks good; achieving an F1 score close to 0.80 is already satisfactory for me. Since a 0.02-point difference is not substantial, it might perform even better on the test dataset.

Additionally, I suggest we compile a set of the best hyperparameters, compute cross-validation for each hyperparameter, and then analyze their mean F1 Score.

What I observe it the val_loss is a bit higher than usual (it is less than 0.4) but this might cause label smoothing since the ground truth values exactly 0 or 1

HCA97 commented 9 months ago

Also, the training loss looks quite wiggly - I wonder if the learning rate might be a bit too large. However, I am not sure if I understand the LR schedule that we use (get_linear_schedule_with_warmup) and the number of warmup steps/train steps. I also think since I have a smaller batch size the number of steps should possibly be different. Do you have any experience with this?

I am not so experienced in it either :smile: I copied the LR values from GUIE-4th place people. When I tried different values than theirs I got worse results.

That is true the warmup steps are independent of epoch size so when we change the batch size in theory we should adjust the warmup parameters but I didn't do it :smile: So far it is working fine :smile:

Maybe setting the warmup steps based on batch size (the initial warm-up parameters set for bs-64) might give us better results but I doubt it.

This is the LR scheduler we use https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.get_linear_schedule_with_warmup

fkemeth commented 9 months ago

Hi @HCA97 ,

I trained again with label smoothing of 0.1, but now with

Here we get 'epoch=5-val_loss=0.7001691460609436-val_f1_score=0.8165072798728943-val_multiclass_accuracy=0.7728392481803894.ckpt' 'epoch=6-val_loss=0.7470129132270813-val_f1_score=0.8084301948547363-val_multiclass_accuracy=0.7685869336128235.ckpt' for the best two models. Below the train loss and val f1 curves. image

I think using learning rate schedules definitely makes sense, but I want to understand first what is the best approach.

What would be interesting to me is how well the best model performs on the test data. Also, we might want to average the weights of the best two models, and see how that performs, see here https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/. What do you think?

fkemeth commented 9 months ago

Additionally, I suggest we compile a set of the best hyperparameters, compute cross-validation for each hyperparameter, and then analyze their mean F1 Score.

-> Yes, I agree. However, for me one training for 8 epochs takes 6 hours using the Kaggle notebook. I still use the default parameters from you, except the alterations I mention above. I pushed everything onto the kaggle branch.

What I observe it the val_loss is a bit higher than usual (it is less than 0.4) but this might cause label smoothing since the ground truth values exactly 0 or 1

-> Yes, the loss is more tricky, since the model has to learn to put 0.1/6 prob to all other classes except the true one, which is much more difficult than to put 0 prob to all other classes. So I think it is expected that we get higher loss on train and val.

fkemeth commented 9 months ago

Hi @HCA97,

I also trained now with the above settings AND the linear learning rate schedule, with

Below are the results: 'epoch=5-val_loss=0.7609413266181946-val_f1_score=0.8209875226020813-val_multiclass_accuracy=0.7888965606689453.ckpt' 'epoch=7-val_loss=0.7274188995361328-val_f1_score=0.8357369303703308-val_multiclass_accuracy=0.8175098896026611.ckpt'

I think the schedule indeed as a positive effect - in particular I think the smaller learning rates at the end. We might even train for a few epochs longer.

Could you submit one of the models, just to check that those results transfer to the hidden test set? This experiment is under version_4, whereas the experiment above is under version_3 in the kaggle notebook.

HCA97 commented 9 months ago

Nice results, could you link me the Kaggle notebook outputs?

Seems like there are no outputs: Screenshot from 2023-09-28 20-59-11

HCA97 commented 9 months ago

I think the schedule indeed as a positive effect - in particular I think the smaller learning rates at the end. We might even train for a few epochs longer.

I am a bit scared of overfitting, but as long as the score validation score goes up it is fine I guess. I think we need a ramp-up step as well, I think in the first few iterations there will be a large gradient from the head layer.

fkemeth commented 9 months ago

Have you tried editing the latest version? It should show output files there.

image

HCA97 commented 9 months ago

I think you need to save the version with its outputs, I don't think I can access to your session.

I think you need to have a setting something like this: Screenshot from 2023-09-28 21-45-58

fkemeth commented 9 months ago

I see, sorry about that. I saved it now with output under Version 20. Please let me know if you cannot see it.

HCA97 commented 9 months ago

outputs are still empty, I pinned Version 20 as default, maybe that would help but still, it is empty. I think we are missing a step

HCA97 commented 9 months ago

interestingly the logs are not empty (https://www.kaggle.com/code/fkemeth/pho-experimentation/log?scriptVersionId=144601644) so that means notebook in fact ran

fkemeth commented 9 months ago

I tried again - Version 21 now - can you see the output there?

Can you maybe go to notebook - edit - and run the first code cell?

HCA97 commented 9 months ago

now I can see the outputs :partying_face:

https://www.kaggle.com/code/fkemeth/pho-experimentation/output?scriptVersionId=144603243

HCA97 commented 9 months ago

I managed to submit the Version=4 Epoch=7 and achieved 0.80 on the public leaderboard :partying_face: We are in 4th place now.

I am trying to submit the other checkpoints as well but the submissions are very unstable they are just failing for no reason (sometimes timeout, sometimes docker doesn't build, etc.) I think more people have started to submit their solutions so there is more load on the servers.

I also trained now with the above settings AND the linear learning rate schedule, with

  • 1000 warmup steps
  • 12800 training steps (which is more than the steps we have, but should be ok I think)

What are your hyperparameters exactly? (LayerNorm and?) You trained with original annotations, right? I am training the YOLO model for the hierarchical classifier. I am using the cleaned annotations maybe that can give us a bit better detection score.

fkemeth commented 9 months ago

Thank you for the submissions! We indeed overfit a bit. Maybe having validation size of 0.25 might make our predictions a bit more trustworthy. I also agree it would be interesting to see the performances of the other models, hopefully the submissions run through again soon. What is interesting is that the first few places have f1 scores of 0.83x, very close together. I assume they all use the same approach/curriculum learning/model. I will again do some reading on past challenges.

The hyperparameters I used are

MODEL_NAME = "ViT-L-14"
PRETRAIN_DATASET = "datacomp_xl_s13b_b90k"
BATCH_SIZE = 16
HEAD_NUMBER = 4
AUGMENTATION = "hca"
FREEZE_BACKBONE = False
WARMUP_STEPS = 1000
EPOCHS = 8
LABEL_SMOOTHING = 0.1
DROPOUT_RATE = 0.5
USE_LINEAR_SCHEDULE = True
USE_LAYER_NORM = True

with the original annotations. I haven't tested the new annotations yet. When the submissions are through, I can remove the old models from the kaggle notebook output (limited output size) and train with the new annotations.

HCA97 commented 9 months ago

you can remove them they are already in the git history in gitlab.

I download the following models:

is there any other model you would like to submit?

HCA97 commented 9 months ago

We indeed overfit a bit. Maybe having validation size of 0.25 might make our predictions a bit more trustworthy. I also agree it would be interesting to see the performances of the other models, hopefully the submissions run through again soon.

Increasing makes sense maybe we can split the data into 3 chunks? %75 training, %15 validation, %10 final testing? But I say 0.03 points is not much of a difference, since the validation dataset is small.

Yeah, I think there is overfitting going on. I found two interesting datasets that can help us they have images for the rare classes (https://github.com/HCA97/Mosquito-Classifiction/issues/10#issuecomment-1737597997)

fkemeth commented 9 months ago

you can remove them they are already in the git history in gitlab.

I download the following models:

* version=4 epoch=7,5

* version=3 epoch=5

is there any other model you would like to submit?

Awesome, thank you!

I had a look at the 4th place solution GUIE solution again (https://www.kaggle.com/competitions/google-universal-image-embedding/discussion/359487), and also given the discussion about weight averaging (https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/), I would like to test

to check if the resulting models then generalize better. What do you think? If you agree, I can do the averaging and provide the resulting models. It should generalize better, however I do not think this will improve our solution by a large margin. It would still be interesting to test though.

Is the submission repo still so large? If not I can also do the submissions.

HCA97 commented 9 months ago

Yes makes sense, you can try to do a linear interpolation between two models instead of simply averaging them like GUIE-4th place people did. I did something like that in the Product Recognition challenge and it increased my score by around 0.06 points. It had a different metric than this challenge but still a good improvement.

You can find the Jupyter Notebook here: https://github.com/HCA97/Product-Recognition/blob/aicrowd/experiments/weight_ensembele.ipynb.

I think averaging models from different runs makes sense more than averaging the same models from the same run. What do you think? We can have multiple runs with different data augmentation like the GUIE-4th place people or different learning rates.

I am not sure about the PyTorch blog post. I would say give it a try, if it takes too much time we can abandon the idea.

Can you try to clone the main branch? I think it can be very large because each time you create a submission you create a new tag and each submission is around 4-6GB. If you have a problem with downloading the repo I will try to delete some stuff maybe this can help.

HCA97 commented 9 months ago

version 4 epoch 7: https://gitlab.aicrowd.com/hca97/mosquitoalert-2023-phase2-starter-kit/-/issues/86

version 4 epoch 5: https://gitlab.aicrowd.com/hca97/mosquitoalert-2023-phase2-starter-kit/-/issues/92

version 3 epoch 5: https://gitlab.aicrowd.com/hca97/mosquitoalert-2023-phase2-starter-kit/-/issues/91

fkemeth commented 9 months ago

Interesting! So our val f1 score seems to always correlate with the challenge score, but still with a very large gap.

fkemeth commented 9 months ago

I managed to submit the Version=4 Epoch=7 and achieved 0.80 on the public leaderboard 🥳 We are in 4th place now.

I am trying to submit the other checkpoints as well but the submissions are very unstable they are just failing for no reason (sometimes timeout, sometimes docker doesn't build, etc.) I think more people have started to submit their solutions so there is more load on the servers.

I also trained now with the above settings AND the linear learning rate schedule, with

  • 1000 warmup steps
  • 12800 training steps (which is more than the steps we have, but should be ok I think)

What are your hyperparameters exactly? (LayerNorm and?) You trained with original annotations, right? I am training the YOLO model for the hierarchical classifier. I am using the cleaned annotations maybe that can give us a bit better detection score.

How do you deal with the images that have more than one mosquito? Do you filter them out? I can imagine that this will help training the yolo model.

HCA97 commented 9 months ago

I used OWL ViT to annotate images with multiple mosquitos and fixed the poor annotations. In the phase2_train_v0_cleaned.csv you can see some images have various annotations. Check the YOLO and OWL-ViT Check Annotation Quality section at https://github.com/HCA97/Mosquito-Classifiction/blob/hc_arcface_cleaning/test_annotations.ipynb

Note: There might be still noisy annotations.

fkemeth commented 9 months ago

From your comment, I understood that you finetune the YOLO model for object detection. (and then also the hierarchical classifier for classification).

My understanding is the following - the Yolo model they provided as the baseline was trained using the bounding boxes they have in the training data. Is that right? Now, in their training data, they have images with more than one mosquito but just one bounding box. Now, you used the Owl-ViT to update the bounding boxes - in the annotations you will then have multiple rows for each of those images, right? With one row for each bounding box.

If you finetune the Yolo model with the corrected annotations, do you show it always one correct bounding box? If so, I think neglecting images with more than one mosquito might help the training of the object localizer.

But maybe I am confusing something.

HCA97 commented 9 months ago

From your comment, I understood that you finetune the YOLO model for object detection. (and then also the hierarchical classifier for classification).

Yes, but keep in mind we never used the provided YOLO model, we always trained our own model.

My understanding is the following - the Yolo model they provided as the baseline was trained using the bounding boxes they have in the training data. Is that right?

Yes, it is correct.

Now, in their training data, they have images with more than one mosquito but just one bounding box. Now, you used the Owl-ViT to update the bounding boxes - in the annotations you will then have multiple rows for each of those images, right? With one row for each bounding box.

Yes, each row represents one annotation.

If you finetune the Yolo model with the corrected annotations, do you show it always one correct bounding box? If so, I think neglecting images with more than one mosquito might help the training of the object localizer.

No, during the training model must find all the boxes, here is exemplary data. But neglecting images with multiple annotations makes sense too.

Ground Truth Annotations:

0 0.31526295731707316 0.37309451219512196 0.12290396341463415 0.14303861788617886
0 0.5487804878048781 0.5407774390243902 0.18064024390243902 0.22129065040650406
0 0.10480182926829268 0.4791666666666667 0.12157012195121951 0.0899390243902439
0 0.4725609756097561 0.18838922764227642 0.14634146341463414 0.18521341463414634
0 0.4176829268292683 0.9076473577235772 0.11166158536585366 0.17098577235772358
0 0.34489329268292684 0.6101371951219512 0.10670731707317073 0.13998983739837398
0 0.9405487804878049 0.4427083333333333 0.11013719512195122 0.23958333333333334
0 0.25590701219512196 0.5593241869918699 0.13414634146341464 0.1877540650406504
0 0.7915396341463414 0.23285060975609756 0.12347560975609756 0.24771341463414634
0 0.2774390243902439 0.2033790650406504 0.13757621951219512 0.13846544715447154

Training Image:

4221

fkemeth commented 9 months ago

Ok got it! Thank you for the explanation!

fkemeth commented 9 months ago

Hi @HCA97,

I made a submission for the model soup, but get Evalution Time Out Error, see here https://gitlab.aicrowd.com/hca97/mosquitoalert-2023-phase2-starter-kit/-/issues/95 I uploaded the model soup and just replaced the clip model name with the name of the new checkpoint. Also, the model soup checkpoint created with your function is half the size of the other model checkpoints grafik

Any idea where this comes from?

Do you think Evaluation time out can occure if downloading all files takes too long? Shall I remove the checkpoints that are not used?

fkemeth commented 9 months ago

I removed the old checkpoints. The soup of the two best models of a single run (version 4 in the kaggle notebook) ran through but gave worse scores than just using the best model. I am also submitting now a soup with the two best models of version 3 and version 4.

HCA97 commented 9 months ago

Any idea where this comes from?

Pytorch Lightning not only stores the model weight but more additional data (I think it stores optimizer params as well) data so you can resume the training without a problem. Probably that is why the souped model is much smaller.

Do you think Evaluation time out can occure if downloading all files takes too long? Shall I remove the checkpoints that are not used?

No, downloading the file takes the Build Packages And Env step. Deleting unused model make sense making the first step faster.

How did you souped the model? Just averaging or the weighted mean?

fkemeth commented 9 months ago

Any idea where this comes from?

Pytorch Lightning not only stores the model weight but more additional data (I think it stores optimizer params as well) data so you can resume the training without a problem. Probably that is why the souped model is much smaller.

Do you think Evaluation time out can occure if downloading all files takes too long? Shall I remove the checkpoints that are not used?

No, downloading the file takes the Build Packages And Env step. Deleting unused model make sense making the first step faster.

How did you souped the model? Just averaging or the weighted mean?

Yes, I changed the ModelCheckpoint callback to

def _default_callbacks() -> List[Callback]: return [ ModelCheckpoint( monitor="val_f1_score", mode="max", save_top_k=2, save_last=False, filename="{epoch}-{val_loss}-{val_f1_score}-{val_multiclass_accuracy}", save_weights_only=True ), ]

that is, I use the save_weights_only=True flag, which leads to the much smaller checkpoints.

fkemeth commented 9 months ago

Any idea where this comes from?

Pytorch Lightning not only stores the model weight but more additional data (I think it stores optimizer params as well) data so you can resume the training without a problem. Probably that is why the souped model is much smaller.

Do you think Evaluation time out can occure if downloading all files takes too long? Shall I remove the checkpoints that are not used?

No, downloading the file takes the Build Packages And Env step. Deleting unused model make sense making the first step faster.

How did you souped the model? Just averaging or the weighted mean?

Just averaging the weights (I use your model soup function). Averaging the models accross runs yields even worse results

repo_url http://gitlab.aicrowd.com/hca97/mosquitoalert-2023-phase2-starter-kit gitlab_project_id 8577 gitlab_issue_iid 98 macro_f1 0.6199411397034559 mean_iou 0.8225544837981195 macro_f1_nofilter 0.7113548126024812 num_iou_filtered 265

In principle, we could tune a weight between the different models (instead of using just 0.5), but I think this makes sense to try this when we have our final models.

HCA97 commented 9 months ago

that is, I use the save_weights_only=True flag, which leads to the much smaller checkpoints

1.8GB is much more manageable.

In principle, we could tune a weight between the different models (instead of using just 0.5), but I think this makes sense to try this when we have our final models.

Yes, this will be the last step.

I think everyone uses a two-step architecture. Because no one's detection score is changing, but the classification score keeps increasing. I will try to submit the hierarchical classifier, but I think we should abandon the idea if the submission score is worse than our current best model. I will do some experiments on CovNext. Or should I focus on a different task?

fkemeth commented 9 months ago

I would also be eager to see how ConvNexts perform - do you know how fast they are?

Else, I think the localization is the biggest lever - do you have ideas on how we could improve that?

HCA97 commented 9 months ago

do you know how fast they are?

I think depends on the model (I think XL mode might be as slow as the ViT-L) but since they are convolution based maybe we can use channel last, which might gives us a bit of speed up (maybe %10)

HCA97 commented 9 months ago

Else, I think the localization is the biggest lever - do you have ideas on how we could improve that?

Maybe train our own localization model with CLIP models? We only need to detect a single mosquito so might be not so hard to implement.

fkemeth commented 9 months ago

But CLIP is too slow, isn't it? Do you think we can improve anything with the Yolo model training?

I completely agree, we just have to detect mosquitos. Maybe we can even use other insects that have bounding boxes for training.

HCA97 commented 9 months ago

But CLIP is too slow, isn't it? Do you think we can improve anything with the Yolo model training?

I completely agree, we just have to detect mosquitos. Maybe we can even use other insects that have bounding boxes for training.

Depending on the model I think ViT-B-16 or ViT-B-32 is less than 300ms. But yeah I think they will be slower than YOLO. I think the YOLO model can be improved https://github.com/HCA97/Mosquito-Classifiction/issues/22#issue-1910148549 Look at the loss curve it seems a bit weird, maybe it is due to the Learning Rate.

The reason I was proposing another CLIP model was that we need to detect a single mosquito, which means we need to output 4 values (corners of the box), which is easy and fast to train (2-3x faster than YOLO).

fkemeth commented 9 months ago

But CLIP is too slow, isn't it? Do you think we can improve anything with the Yolo model training? I completely agree, we just have to detect mosquitos. Maybe we can even use other insects that have bounding boxes for training.

Depending on the model I think ViT-B-16 or ViT-B-32 is less than 300ms. But yeah I think they will be slower than YOLO. I think the YOLO model can be improved #22 (comment) Look at the loss curve it seems a bit weird, maybe it is due to the Learning Rate.

The reason I was proposing another CLIP model was that we need to detect a single mosquito, which means we need to output 4 values (corners of the box), which is easy and fast to train (2-3x faster than YOLO).

I have no experience with training object localizers. How long does it take to finetune the YOLO model? Do you think it would make sense to adjust the learning rate? Do you know how the current model fails? Does it find the wrong objects, or does it find the right object but with a too distorted bounding box? If the latter, we might just continue training the last layer only.

HCA97 commented 9 months ago

How long does it take to finetune the YOLO model? Do you think it would make sense to adjust the learning rate?

2 hours roughly, idk to be honest I am just guessing :)

Do you know how the current model fails? Does it find the wrong objects, or does it find the right object but with a too distorted bounding box? If the latter, we might just continue training the last layer only.

Good idea :) I will check it.