Segmentation Faults on Dataloaders

I'm trying to run a simple test finetune on the stable-audio-1.0 checkpoint. For hardware I have 2x A100 40GB and 128GB RAM. When I initialize training, it generates the first 3 examples without issue, and usually continues for about 100-200 steps before erroring out. Some excerpts:

Epoch 0:  15%|██████████▌                                                           | 151/1000 [02:30<14:07,  1.00it/s, v_num=a5zk, train/loss=0.656, train/std_data=0.978, train/lr=3.9e-5, train/mse_loss=0.656]
ERROR: Unexpected segmentation fault encountered in worker.                                                                                                                                                       
RuntimeError: DataLoader worker (pid 466777) is killed by signal: Segmentation fault.
...
[rank1]:   File "/opt/conda/envs/stable/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
[rank1]:     _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 466777) is killed by signal: Segmentation fault.

I have tried batch sizes ranging from 4 - 64, with single and both GPU's engaged, and num-workers values between 2 - 16. The error occurs consistently each time. Any ideas on how I can fix this?

Stability-AI / stable-audio-tools

Segmentation Faults on Dataloaders #117