LambdaLabsML / examples

Deep Learning Examples
MIT License
805 stars 103 forks source link

Too slow on 2xA100 SXM4 #26

Closed cihankaradogan closed 1 year ago

cihankaradogan commented 1 year ago

Hello, I started training on 2xA100 SMX4 according to your tutorial. I am using pokemon.yaml file. My dataset contains 1743 images and I am loading it via huggingface. The training has been going on for 13 hours and the first epoch isn't even over yet. There are neither images produced from validation texts nor a saved checkpoint in the log folder. It says your training takes 6 hours with the 2xA6000, wouldn't you expect a similar performance from the A100?

Ekran Resmi 2022-10-20 14 36 45 Ekran Resmi 2022-10-20 14 36 36