Open ver217 opened 2 years ago
Have you tried disabling flattening? Also, there is a version of FSDP in pytorch as prototype as well. Can you give that version a try?
Hi, I tried disabling flattening and this solved my problem. However, I wonder why I can't enable flattening.
Hi, I tried disabling flattening and this solved my problem. However, I wonder why I can't enable flattening.
I think unfortunately there is a bug somewhere with flattening. I don't know where it is until exactly debugging this issue.
Hi, I tried disabling flattening and this solved my problem. However, I wonder why I can't enable flattening.
I think unfortunately there is a bug somewhere with flattening. I don't know where it is until exactly debugging this issue.
Hi, I'm trying to debug this issue and find out that activation checkpointing and mix precision lead to this issue. I notice that comments say we get two different gradient accumulation objects in mixed precision mode. I also find that the backward post hook is registered but never triggered for parameters with checkpoint.
Oh, that's cool! It is indeed an issue that we have never resolved in a good way. cc @zhaojuanmao too.
If you have a small reproducible test case, that'd be great.
Also, please try different pytorch versions. Maybe it will behave differently across different versions?
Oh, that's cool! It is indeed an issue that we have never resolved in a good way. cc @zhaojuanmao too.
If you have a small reproducible test case, that'd be great.
Also, please try different pytorch versions. Maybe it will behave differently across different versions?
Ok, I will try different pytorch versions and test more cases. However, I try to register the hook just on the parameter instead of grad accumulator object, and the hook can be triggered normally. Could you tell the reason why you don't just register the hook on parameter?
If I recall it correctly we want the hook to fire with the gradient computed. If you register the hook on the parameters do you get the hook fire in the right time after the gradient is computed?
Hi @ver217, would you mind sharing a bit about how you are modifying the source code of transformers
to use checkpoint_wrapper
?
Oh, that's cool! It is indeed an issue that we have never resolved in a good way. cc @zhaojuanmao too.
If you have a small reproducible test case, that'd be great.
Also, please try different pytorch versions. Maybe it will behave differently across different versions?
@min-xu-ai Have we run into this issue before? Do you remember the context?
@ver217 Friendly reminder to share a small repro if possible to help us fix this.
I'm also experiencing this issue:
File "/fsx/users/hack/fairscale/fairscale/nn/model_parallel/layers.py", line 290, in forward
output_parallel = F.linear(input_parallel, self.weight, self.bias)
RuntimeError: setStorage: sizes [512, 512], strides [1, 512], storage offset 70686720, and itemsize 2 requiring a storage size of 141897728 are out of bounds for storage of size 0
flatten_parameters: False
mitigates the problem, but it would be good not to leave performance on the table :)
@min-xu-ai @anj-s A repro can be found at P521821959 (meta only)
@edward-io, have you tried pytorch version FSDP? It likely has better performance already in the flatten case. @zhaojuanmao
@min-xu-ai thanks for the recommendation! I've used the pytorch distributed FSDP, but haven't tried it with model parallel yet.
I have a similar issue: setStorage: sizes [144, 1940], strides [1, 144], storage offset 1881800, and itemsize 4 requiring a storage size of 8644640 are out of bounds for storage of size 0
this is with the fsdp_native strategy. however, there is no flatten_parameter option for this one?
Are you using pytorch version or FairScale version?
This is with the PyTorch version -> fsdp_native
I see. Can you please open an issue with pytorch team if you haven’t. It is better to have a small reproduction code for people to debug it
ups, yes sorry .
I tried now the fairscale one and get "mat2 must be a matrix, got 1-D tensor" with flatten_parameters: False. Strange. Will investigate further.
No worries. The error you are getting is probably due to the fact that your code try to use params in a matmul outside of the forward function call. Outside of the forward call, the params your module originally have are flattened and they are 1d tensors.
I get the exact same stack from set_rng_state to set_state and illegal memory access with a different app(stable diffusion). How does one disable flattening?
I'm trying to train GPT2 with FSDP.
My environment is below.
PyTorch: 1.10.0+cu113 Fairscale: 0.4.5 transformers: 4.16.2 Tesla A100 x8
When I set
CUDA_LAUNCH_BLOCKING=1
, I got:When
CUDA_LAUNCH_BLOCKING
was not set, I got:I train my model like:
The GPT2 provided by
transformers
use torch's checkpoint. I also tried to use fairscale'scheckpoint_wrapper
by modifying the source code oftransformers
. However, I still got the error. Could you help me figure out this problem?