Open riccardofelluga opened 1 month ago
With the fix in #1136 the test does not trigger as expected. The key error needs more investigation #1013
CI error seems related to torch nightly
Not sure why, but for me the early exit test seems to also succeed in main. Is that intended? Could we try to have a more representative repro? I found this while trying to see if #1164 would also resolve this.
I cannot reproduce the issue with
pytest thunder/benchmarks/targets.py -k test_nanogpt_cross_entropy[forward-thunder]
on current master, so I'm not quite sure what we're fixing here.
@t-vi thanks for testing, let me update the test script 👍
The issue still happens, it can be observed by running: thunder/benchmarks/benchmark_litgpt.py --model_name Llama-3-70B --n_layers 4 --compile thunder --checkpoint_activations True
I'll be updating the test to reflect this
Before submitting
- [x] Was this discussed/approved via a Github issue? (no need for typos and docs improvements) - [x] Did you read the [contributor guideline](https://github.com/Lightning-AI/pytorch-lightning/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests?What does this PR do?
Fixes #1013 by early exit. What happened after #899 is that sometimes when the cut information included tensors that didn't need recompilation, then such tensors wouldn't have appeared in the
proxy_order
dictionary as they weren't encountered by the functionorder_proxies
in the sub-symbols list. Probably this is due to how the cut is decided between the regions and it requires further investigation to fully understand. This fix solves the key error issue and saves computation as there is nothing to rematerialize so the trace does not need to be changed.Did you have fun?
This was actually particularly interesting to debug. Thankfully we had the test:
to validate results and inspect the cut.