DunnBC22 / Vision_Audio_and_Multimodal_Projects

This repository includes all computer vision, audio, document AI, and multimodal projects.
36 stars 10 forks source link

Your Tobacco3482 dataset has 2x3482 examples (Tobacco Dataset & DiT Transformer Project_v3.ipynb) #3

Closed gordon-lim closed 4 months ago

gordon-lim commented 4 months ago

Your dataset has twice the number of examples as the original Tobacco3482 dataset downloaded from Kaggle. When I downloaded the dataset from Kaggle, there was a copy of the Tobacco3482-jpg directory within the Tobacco3482-jpg directory itself so its likely that you had duplicates. Since train_test_split is random, its likely that you were testing on training data so your results are unfortunately likely biased.

Edit: I looked at your v2 and it correctly has 3482. So the train-test overlap likely explains the performance improvement.

DunnBC22 commented 4 months ago

After looking it over, you are right about both the dataset size and the conclusion about potential bias. I must have just expected the number of samples in v3 of that project to be the same as v2 since (I thought) I was using the same dataset from the same data source.

Thank you for bringing that to my attention! I made the appropriate changes and removed v3 since it does introduce bias. Please use v2 instead. Sorry for the inconvenience.

Respectfully, BD