Open IvanYashchuk opened 1 day ago
@kshitij12345, could you please take a look? Is it Thunder's problem or PyTorch FX's problem?
I am seeing two separate failures in different environment (with more recent PyTorch) both of which seem to occur after the splitting phase. So, I think this particular KeyError is coming from fx.split_module
which has been fixed in PyTorch core.
Env 1 - In local setup, PyTorch version 2.5.0a0+git08b5e07, nvFuser version 0.2.13+git84e5d23 and thunder commit 30e4aa1e67005c58219d7f06b46836eedb74b27a, I am seeing
File "/home/kkalambarkar/lightning-thunder/thunder/core/trace_interpreter.py", line 63, in interpret_trace
prim_func = symbol_mapper(symbol) if symbol_mapper is not None else symbol.sym
File "/home/kkalambarkar/lightning-thunder/thunder/core/transforms.py", line 2514, in vjp_symbol_mapper
raise NotImplementedError(f"VJP for {symbol.sym.id} is not implemented")
NotImplementedError: VJP for PrimIDs.COPY_WITH_SETITEM is not implemented
We already have an issue for the same - https://github.com/Lightning-AI/lightning-thunder/issues/1240
Env 2 (with latest versions of PyTorch, nvFuser and thunder) - Using the internal docker image with PyTorch version 2.6.0a0+gita777dea, nvFuser version 0.2.15+git7616b54 and thunder commit 30e4aa1e67005c58219d7f06b46836eedb74b27a, I am seeing error from nvFuser. Will file an issue with nvFuser for the same. EDIT - Issue filed at https://github.com/NVIDIA/Fuser/issues/3176
On the latest Thunder (dafc79d21c04769c5e9d1fb737b8cd21d0841e69) I tried the snippet from https://github.com/Lightning-AI/lightning-thunder/issues/1174#issuecomment-2383823134 and got the following error:
My PyTorch version is
'2.4.0a0+git3827810'
.