Open nousr opened 10 months ago
@awaelchli If help is still wanted please assign this issue to me. Have a bit of time to work on it.
Of course, @nik777, please go ahead, that would be great! No to discourage you of course, but I think it might be a hard one to solve :)
Any progress on this? Thank's so much!
I got
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f22c43552b0> returned NULL without setting an error
when setting accumulate_grad_batches = 2. But I see nothing helpful in the log.Error gone when changing
DDPStrategy(static_graph=False,)
, oraccumulate_grad_batches
back to 1, orbatch_size=3
(total len(data) = 9).I wonder if there is some conflict between DDPStrategy.static_graph=True, accumulate_grad_batches and batch_size.
I want to keep
static_graph=True
because I am using .gradient_checkpointing_enable().Anyone helps, please.
Minimal code to reproduce the error:
Environment:
Originally posted by @iamlockelightning in https://github.com/Lightning-AI/pytorch-lightning/discussions/18080
I'm also observing this issue in the latest version of pytorch-lightning (2.1.3)
cc @justusschock @awaelchli