Closed denisbeslic closed 1 month ago
Error not caused by export. Not happening when running with --noise-sampler False and --noise-std -1.0. It has something to do with the non-zero masks and noise addition..
It was caused by the non_zero_mask non_zero_mask = torch.nonzero(prediction)
. It will be replaced with old mask non_zero_mask = prediction != 0
The error occurred with seq2squiggle (dev-branch) in prediction mode, triggering a "device-side assert" in CUDA. This caused the process to fail, with an error message indicating that the index was out of bounds. The error stack trace suggests a problem during the model's prediction loop, where a tensor operation failed, leading to the RuntimeError.
This issue began after switching from numpy-based CPU calculations to fully using PyTorch. The error specifically occurs during the export_and_clear_results function, which is triggered at the end of an epoch. It's only happening when running with at leasst -n 20,000.
Full output: