Closed fkemeth closed 1 year ago
Hi @HCA97
I created the bounding boxes using the Owl-ViT model. The version I use, google/owlvit-base-patch32, uses the ViT-B/32 CLIP image encoder. I get mixed results on the training data, with a mean IoU of 0.76 and a median of 0.81. I stored the bounding boxes in the Kaggle Output folder under /kaggle/working/owl_vit_image_bboxes.csv.
Below are some examples where the IoU is below 0.1:
What I learned from that analysis is that the Owl-Vit has some misclassifications. Some we may be able to avoid using some rule-based postprocessing of the candidate boxes (like dont use boxes when they are as large as the image).
Also, there are still some images with multiple mosquitos in the training data. We should use all cutouts for training the classifier, in particular for the under-represented classes. So we may increase our training data a bit.
They said there are no images with more than one mosquito in the test data - maybe it might still be worth to test a submission with the Owl-ViT bounding boxes. What do you think?
I am open to Owl-ViT if it performs as well as the YOLO classifier, then we can eliminate training one model, simplifying our workflow.
Could you try to make a submission in GitLab? I did so many submissions there the GitLab repo might be too large to download now :P I think I already shared the repo with you. If not you cannot, I can do the submission too I am fine with both options. If you create a submission using the GitLab repo could you make a branch for yourself so we don't get conflicts in the future? Because pulling the main takes a bit of time.
ViT-B 32 should run as fast as the YOLOv8-s model so there should be no problem with inference speed.
You are right, the repo is too large for me to clone it.
If you want, you can make a submission - the finetuned CLIP model I mentioned earlier is in the Kaggle notebook (in the 2nd cell there is a download link). But I can also do the submission once I am back from travel.
I will also check if I can optimize the Owl-ViT bounding box selection a bit, I think this can be improved by some rule-based filtering of the candidate boxes.
I saw your experiments I will try to submit your results today hopefully they improve our results.
I attempted to use OWL-ViT as a replacement for YOLO, but I keep encountering an Inference failed
error. Initially, I suspected a runtime error due to OWL-ViT's slower processing compared to the YOLO model. Because OWL-ViT takes approximately 600-700ms while YOLO takes 150-300ms.
However, even when I attempted to use ViT-B-16 for classification (to reduce the classification time), I encountered the same error.
For ViT-B-16 submissions, you can find the details in the following links:
Additionally, for ViT-L-14 submissions, you can refer to this link:
Upon reviewing the debug-logs
, I came across the following message. While I'm not entirely sure if this is the root cause, since the validation step passes, but the prediction step fails:
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /google/owlvit-base-patch32/resolve/main/preprocessor_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f536769a020>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: eaaf23cc-7807-476a-9b3e-ea79c26a9be0)')' thrown while requesting HEAD https://huggingface.co/google/owlvit-base-patch32/resolve/main/preprocessor_config.json
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /google/owlvit-base-patch32/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5367741420>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: b903e41b-2ea9-4518-8ee0-b4e900dee794)')' thrown while requesting HEAD https://huggingface.co/google/owlvit-base-patch32/resolve/main/tokenizer_config.json
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /google/owlvit-base-patch32/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5367698100>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: d8ac13ee-a131-40db-98d2-19ddcc584737)')' thrown while requesting HEAD https://huggingface.co/google/owlvit-base-patch32/resolve/main/config.json
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /google/owlvit-base-patch32/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5367741990>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 19dfa170-681f-48ef-8061-fee9dcb01156)')' thrown while requesting HEAD https://huggingface.co/google/owlvit-base-patch32/resolve/main/config.json
It appears that Huggingface is attempting to load the model, but it doesn't make sense since the model is already cached.
@HCA97 that is weird. It seems there can be different reasons, see here:
What is confusing is that you have uploaded the models, so huggingface should not try to download them. Could it be that the path is wrong? (I checked, and to me the cache dir looks right). Could it be a conflict in the huggingface version? The stackoverflow post says to use a different requests version, but I rather think that they have a firewall on their server that prevents huggingface from loading the model.
Definitely, I think there is no internet access. But what baffles me validation step is passed and after half an hour in the prediction step, it fails. I feel like the issue is not caused by internet access, but by how unstable the inference time is in VMs (https://discourse.aicrowd.com/t/submissions-are-quite-unstable/9095).
The path is correct because I turned off my internet and ran the local evaluation script, and it didn't print the error messages. I will try to create an issue in the AIcrowd forum maybe the organizers can help us. How secretive should I be?
I submitted another submission for OWL-ViT if this time doesn't work I don't know what to do.
This is submission code: https://gitlab.aicrowd.com/hca97/mosquitoalert-2023-phase2-starter-kit/-/blob/df1a88503c7d5b2d66ba6f011e8a7a95beffbadd/my_models/owl_vit_clip_model.py
Maybe we get boxes out of bounds or the threshold of 0.01 leads to no boxes detected. Those things we may want to tune. I will also look into improving the accuracy of the Owl-ViT model, hopefully surpassing the Yolo performance.
Given the response from dipam, we should downsample the images if they are too large. What do you think?
I don't understand how come prediction %100 slower than the validation step.
Do you think it is possible to change the input size of the model since it is a Transformer model I don't know if we can simply change it is input size. However, according to Huggingface, it seems like it is possible;
I tested it, but it seems changing input size does not work - at least I get an error when running the model. I assume it is not fully transformer based, but has some fully connected layers at the end.
I tried to remove the padding of the text tokenizer, and I also put the text encoding outside (we only have to do that once, since we always use the same text, but to no avail, see the figure below.
I did another submission with only Owl-ViT for classification it just outputs the same class and it still failed. I think even if we change the size this issue will persist. But caching the text token makes sense.
https://gitlab.aicrowd.com/hca97/mosquitoalert-2023-phase2-starter-kit/-/issues/82
Maybe the image needs to be divisible by 32?
Maybe the image needs to be divisible by 32?
No, I think it needs exactly the input size of 768, other multiples of 32 did not work for me.
Do you think it might be worth converting the owl-vit to ONNX?
https://www.kaggle.com/code/ivanpan/pytorch-clip-onnx-to-speed-up-inference
I say don't bother with it, I tried to export the YOLO models in other frameworks (ONNX, OpenVivo, etc.) (https://github.com/HCA97/Mosquito-Classifiction/issues/5) which they claim are faster. All of them were slower even with f16 than PyTorch 2.0.1
.
I did another submission with only Owl-ViT for classification it just outputs the same class and it still failed. I think even if we change the size this issue will persist. But caching the text token makes sense.
https://gitlab.aicrowd.com/hca97/mosquitoalert-2023-phase2-starter-kit/-/issues/82
We get weird runtime errors :(
Good Idea :)
Do you know what the inference time of Owl-ViT? Is it ViT-B model?