Process Terminated during Finetuning

Jozdien commented 2 years ago

I was trying to use the Audioset pretrained model for finetuning on a very small dataset to test it out on. At first the process would simply be killed with "Out of memory" in the log, but when I moved to a larger system, the process ran for longer before returning this error:

Traceback (most recent call last):
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 379, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 499, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device

Traceback (most recent call last):
  File "../../src/run.py", line 99, in <module>
    train(audio_model, train_loader, val_loader, args)
  File "/home/ubuntu/ast_conv/src/traintest.py", line 220, in train
    torch.save(audio_model.state_dict(), "%s/models/audio_model.%d.pth" % (exp_dir, epoch))
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 380, in save
    return
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 259, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 322619584 vs 322619472
terminate called after throwing an instance of 'c10::Error'
  what():  [enforce fail at inline_container.cc:298] . unexpected pos 322619584 vs 322619472
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7f20ac5b47a7 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x24e10c0 (0x7f20f14190c0 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x24dc69c (0x7f20f141469c in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0x9a (0x7f20f1419afa in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x173 (0x7f20f1419d83 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x1a5 (0x7f20f141a075 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xa7ffe3 (0x7f2103160fe3 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x4ff188 (0x7f2102be0188 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x50048e (0x7f2102be148e in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: python() [0x5cf938]
frame #10: python() [0x52cae8]
frame #11: python() [0x52cb32]
frame #12: python() [0x52cb32]
<omitting python frames>
frame #17: python() [0x654354]
frame #19: __libc_start_main + 0xe7 (0x7f21079dcbf7 in /lib/x86_64-linux-gnu/libc.so.6)

run.sh: line 46:  1703 Aborted                 (core dumped) CUDA_CACHE_DISABLE=1 python -W ignore ../../src/run.py --model ${model} --dataset ${dataset} --data-train ${tr_data} --data-val ${val_data} --exp-dir $exp_dir --label-csv ./data/class_labels_indices.csv --n_class 3 --lr $lr --n-epochs ${epoch} --batch-size $batch_size --save_model True --freqm $freqm --timem $timem --mixup ${mixup} --bal ${bal} --tstride $tstride --fstride $fstride --imagenet_pretrain $imagenetpretrain --audioset_pretrain $audiosetpretrain > $exp_dir/log.txt

As far as I can tell, that OSError could indicate that the filesize has been exceeded, not just that the total memory is overflowing. I haven't changed traintest.py except for adding an elif condition for the finetuning dataset. Did you run into this error while finetuning, or does it seem like something you understand the cause of?

YuanGongND commented 2 years ago

It doesn't seem to be the problem of the code. Can you check how much space you have on your disk? In traintest.py, we save the output predictions of each epoch, which might take a few GBs for the training process, depending on how large your test set is.

Jozdien commented 2 years ago

I don't think it's the disk space because I'm testing this on a very small dataset (<20 samples). Could it be possible that something is writing a large amount of data to one particular file, and the filesize limits vary between systems (very speculative)?

YuanGongND commented 2 years ago

I see. I would suggest running the ESC-50 recipe and see if the same error raises - it would be fast and easy to run; if you still see the error you can check your os; otherwise, you can check if your modification is correct or not.

traintest.py does save the prediction files, which could be large, but considering you have only 20 samples, it is unlikely the case.

Jozdien commented 2 years ago

Yep, got the same error, although this time far later in the training process (ESC-50 ran for about 16 epochs before halting, on my dataset I don't recall it making any progress on training). Just to know what kind of specifications I should use, what amount of disk space do you recommend using?

YuanGongND / ast

Process Terminated during Finetuning #30