HCA97 / Mosquito-Classifiction

7th place solution of Aicrowd Mosquito Alert Competition
GNU General Public License v3.0
1 stars 1 forks source link

Evaluate Owl-ViT on new challenge data #16

Closed fkemeth closed 1 year ago

HCA97 commented 1 year ago

Good Idea :)

Do you know what the inference time of Owl-ViT? Is it ViT-B model?

fkemeth commented 1 year ago

Hi @HCA97

I created the bounding boxes using the Owl-ViT model. The version I use, google/owlvit-base-patch32, uses the ViT-B/32 CLIP image encoder. I get mixed results on the training data, with a mean IoU of 0.76 and a median of 0.81. I stored the bounding boxes in the Kaggle Output folder under /kaggle/working/owl_vit_image_bboxes.csv.

Below are some examples where the IoU is below 0.1:

image

image

image

image

image

What I learned from that analysis is that the Owl-Vit has some misclassifications. Some we may be able to avoid using some rule-based postprocessing of the candidate boxes (like dont use boxes when they are as large as the image).

Also, there are still some images with multiple mosquitos in the training data. We should use all cutouts for training the classifier, in particular for the under-represented classes. So we may increase our training data a bit.

They said there are no images with more than one mosquito in the test data - maybe it might still be worth to test a submission with the Owl-ViT bounding boxes. What do you think?

HCA97 commented 1 year ago

I am open to Owl-ViT if it performs as well as the YOLO classifier, then we can eliminate training one model, simplifying our workflow.

Could you try to make a submission in GitLab? I did so many submissions there the GitLab repo might be too large to download now :P I think I already shared the repo with you. If not you cannot, I can do the submission too I am fine with both options. If you create a submission using the GitLab repo could you make a branch for yourself so we don't get conflicts in the future? Because pulling the main takes a bit of time.

HCA97 commented 1 year ago

ViT-B 32 should run as fast as the YOLOv8-s model so there should be no problem with inference speed.

fkemeth commented 1 year ago

You are right, the repo is too large for me to clone it.

If you want, you can make a submission - the finetuned CLIP model I mentioned earlier is in the Kaggle notebook (in the 2nd cell there is a download link). But I can also do the submission once I am back from travel.

I will also check if I can optimize the Owl-ViT bounding box selection a bit, I think this can be improved by some rule-based filtering of the candidate boxes.

HCA97 commented 1 year ago

I saw your experiments I will try to submit your results today hopefully they improve our results.

HCA97 commented 1 year ago

I attempted to use OWL-ViT as a replacement for YOLO, but I keep encountering an Inference failed error. Initially, I suspected a runtime error due to OWL-ViT's slower processing compared to the YOLO model. Because OWL-ViT takes approximately 600-700ms while YOLO takes 150-300ms.

However, even when I attempted to use ViT-B-16 for classification (to reduce the classification time), I encountered the same error.

For ViT-B-16 submissions, you can find the details in the following links:

Additionally, for ViT-L-14 submissions, you can refer to this link:

Upon reviewing the debug-logs, I came across the following message. While I'm not entirely sure if this is the root cause, since the validation step passes, but the prediction step fails:

'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /google/owlvit-base-patch32/resolve/main/preprocessor_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f536769a020>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: eaaf23cc-7807-476a-9b3e-ea79c26a9be0)')' thrown while requesting HEAD https://huggingface.co/google/owlvit-base-patch32/resolve/main/preprocessor_config.json

'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /google/owlvit-base-patch32/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5367741420>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: b903e41b-2ea9-4518-8ee0-b4e900dee794)')' thrown while requesting HEAD https://huggingface.co/google/owlvit-base-patch32/resolve/main/tokenizer_config.json

'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /google/owlvit-base-patch32/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5367698100>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: d8ac13ee-a131-40db-98d2-19ddcc584737)')' thrown while requesting HEAD https://huggingface.co/google/owlvit-base-patch32/resolve/main/config.json

'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /google/owlvit-base-patch32/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5367741990>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 19dfa170-681f-48ef-8061-fee9dcb01156)')' thrown while requesting HEAD https://huggingface.co/google/owlvit-base-patch32/resolve/main/config.json

It appears that Huggingface is attempting to load the model, but it doesn't make sense since the model is already cached.

fkemeth commented 1 year ago

@HCA97 that is weird. It seems there can be different reasons, see here:

https://stackoverflow.com/questions/75110981/sslerror-httpsconnectionpoolhost-huggingface-co-port-443-max-retries-exce

What is confusing is that you have uploaded the models, so huggingface should not try to download them. Could it be that the path is wrong? (I checked, and to me the cache dir looks right). Could it be a conflict in the huggingface version? The stackoverflow post says to use a different requests version, but I rather think that they have a firewall on their server that prevents huggingface from loading the model.

HCA97 commented 1 year ago

Definitely, I think there is no internet access. But what baffles me validation step is passed and after half an hour in the prediction step, it fails. I feel like the issue is not caused by internet access, but by how unstable the inference time is in VMs (https://discourse.aicrowd.com/t/submissions-are-quite-unstable/9095).

The path is correct because I turned off my internet and ran the local evaluation script, and it didn't print the error messages. I will try to create an issue in the AIcrowd forum maybe the organizers can help us. How secretive should I be?

HCA97 commented 1 year ago

I submitted another submission for OWL-ViT if this time doesn't work I don't know what to do.

This is submission code: https://gitlab.aicrowd.com/hca97/mosquitoalert-2023-phase2-starter-kit/-/blob/df1a88503c7d5b2d66ba6f011e8a7a95beffbadd/my_models/owl_vit_clip_model.py

fkemeth commented 1 year ago

Maybe we get boxes out of bounds or the threshold of 0.01 leads to no boxes detected. Those things we may want to tune. I will also look into improving the accuracy of the Owl-ViT model, hopefully surpassing the Yolo performance.

fkemeth commented 1 year ago

Given the response from dipam, we should downsample the images if they are too large. What do you think?

HCA97 commented 1 year ago

I don't understand how come prediction %100 slower than the validation step.

HCA97 commented 1 year ago

Do you think it is possible to change the input size of the model since it is a Transformer model I don't know if we can simply change it is input size. However, according to Huggingface, it seems like it is possible;

https://huggingface.co/docs/transformers/v4.33.3/en/model_doc/owlvit#transformers.OwlViTVisionConfig.image_size

https://huggingface.co/docs/transformers/v4.33.3/en/model_doc/owlvit#transformers.OwlViTImageProcessor.size

fkemeth commented 1 year ago

I tested it, but it seems changing input size does not work - at least I get an error when running the model. I assume it is not fully transformer based, but has some fully connected layers at the end.

I tried to remove the padding of the text tokenizer, and I also put the text encoding outside (we only have to do that once, since we always use the same text, but to no avail, see the figure below.

image

HCA97 commented 1 year ago

I did another submission with only Owl-ViT for classification it just outputs the same class and it still failed. I think even if we change the size this issue will persist. But caching the text token makes sense.

https://gitlab.aicrowd.com/hca97/mosquitoalert-2023-phase2-starter-kit/-/issues/82

HCA97 commented 1 year ago

Maybe the image needs to be divisible by 32?

fkemeth commented 1 year ago

Maybe the image needs to be divisible by 32?

No, I think it needs exactly the input size of 768, other multiples of 32 did not work for me.

fkemeth commented 1 year ago

Do you think it might be worth converting the owl-vit to ONNX?

https://www.kaggle.com/code/ivanpan/pytorch-clip-onnx-to-speed-up-inference

HCA97 commented 1 year ago

I say don't bother with it, I tried to export the YOLO models in other frameworks (ONNX, OpenVivo, etc.) (https://github.com/HCA97/Mosquito-Classifiction/issues/5) which they claim are faster. All of them were slower even with f16 than PyTorch 2.0.1.

HCA97 commented 1 year ago

I did another submission with only Owl-ViT for classification it just outputs the same class and it still failed. I think even if we change the size this issue will persist. But caching the text token makes sense.

https://gitlab.aicrowd.com/hca97/mosquitoalert-2023-phase2-starter-kit/-/issues/82

We get weird runtime errors :(