Explore usage of vision transformers

I researched about vision transformers and managed to implement code from this tutorial https://medium.com/mlearning-ai/vision-transformers-from-scratch-pytorch-a-step-by-step-guide-96c3313c2e0c, but the results weren't good due to the amount of overfitting. I read in one of the papers that this network requires huge amounts of data, even more than CNNs. Currently, we have a dataset with 10k images, which clearly isn't sufficient. Should I try with a larger dataset, let's say with 100k images, or this network is too much overkill for what we are trying to do?

AutoMecUA / AutoMec-AD

Explore usage of vision transformers #195