Closed JLenzy closed 4 months ago
Updating this in case others run into a similar issue: The dataloaders were crashing because of invalid audio files within the dataset (FMA): https://github.com/mdeff/fma/wiki In my case, the solution was to write a small script to pre-filter audio files to check for some basic validity (minimum n samples after loading into torchaudio)
I'm trying to run a simple test finetune on the stable-audio-1.0 checkpoint. For hardware I have 2x A100 40GB and 128GB RAM. When I initialize training, it generates the first 3 examples without issue, and usually continues for about 100-200 steps before erroring out. Some excerpts:
I have tried batch sizes ranging from 4 - 64, with single and both GPU's engaged, and num-workers values between 2 - 16. The error occurs consistently each time. Any ideas on how I can fix this?