Closed gordon-lim closed 4 months ago
After looking it over, you are right about both the dataset size and the conclusion about potential bias. I must have just expected the number of samples in v3 of that project to be the same as v2 since (I thought) I was using the same dataset from the same data source.
Thank you for bringing that to my attention! I made the appropriate changes and removed v3 since it does introduce bias. Please use v2 instead. Sorry for the inconvenience.
Respectfully, BD
Your dataset has twice the number of examples as the original Tobacco3482 dataset downloaded from Kaggle. When I downloaded the dataset from Kaggle, there was a copy of the
Tobacco3482-jpg
directory within theTobacco3482-jpg
directory itself so its likely that you had duplicates. Sincetrain_test_split
is random, its likely that you were testing on training data so your results are unfortunately likely biased.Edit: I looked at your v2 and it correctly has 3482. So the train-test overlap likely explains the performance improvement.