Closed serg06 closed 3 years ago
I think the mixed precision training isn't playing well with the lstm call. Can you try wrapping its context using 'with autocast(enabled=False):"
@kevjshih I tried this:
self.lstm.flatten_parameters()
with amp.autocast(enabled=False):
outputs, _ = self.lstm(x)
and this:
with amp.autocast(enabled=False):
self.lstm.flatten_parameters()
outputs, _ = self.lstm(x)
make sure to update to the latest version of pytorch. closing due to inactivity.
System:
Ubuntu 18.04
GTX 1070
Running with nvidia-docker
Problem:
Whenever I try to run
train.py
, with"fp16_run": true
, it fails immediately on the first epoch:What I've tried:
Workarounds:
"fp16_run": false
: Works successfully.CUDNN_STATUS_EXECUTION_FAILED
about 1,000 steps in.)Debugging:
model()
right before the error occurs, with fp16 both enabled and disabled. Both print statements looked the same, with nothing obvious out of the ordinary.