riccardofelluga commented 1 month ago

Before submitting

- [x] Was this discussed/approved via a Github issue? (no need for typos and docs improvements) - [x] Did you read the [contributor guideline](https://github.com/Lightning-AI/pytorch-lightning/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests?

What does this PR do?

Fixes #1013 by early exit. What happened after #899 is that sometimes when the cut information included tensors that didn't need recompilation, then such tensors wouldn't have appeared in the proxy_order dictionary as they weren't encountered by the function order_proxies in the sub-symbols list. Probably this is due to how the cut is decided between the regions and it requires further investigation to fully understand. This fix solves the key error issue and saves computation as there is nothing to rematerialize so the trace does not need to be changed.

Did you have fun?

This was actually particularly interesting to debug. Thankfully we had the test:

pytest thunder/benchmarks/targets.py -k test_nanogpt_cross_entropy[forward-thunder]

to validate results and inspect the cut.

IvanYashchuk commented 2 weeks ago

1136 fixes a different problem. We still need this fix.

riccardofelluga commented 2 weeks ago

With the fix in #1136 the test does not trigger as expected. The key error needs more investigation #1013

riccardofelluga commented 1 week ago

CI error seems related to torch nightly

t-vi commented 1 week ago

Not sure why, but for me the early exit test seems to also succeed in main. Is that intended? Could we try to have a more representative repro? I found this while trying to see if #1164 would also resolve this.

t-vi commented 1 week ago

I cannot reproduce the issue with

pytest thunder/benchmarks/targets.py -k test_nanogpt_cross_entropy[forward-thunder]

on current master, so I'm not quite sure what we're fixing here.

riccardofelluga commented 6 days ago

@t-vi thanks for testing, let me update the test script 👍

riccardofelluga commented 4 days ago

The issue still happens, it can be observed by running: thunder/benchmarks/benchmark_litgpt.py --model_name Llama-3-70B --n_layers 4 --compile thunder --checkpoint_activations True

I'll be updating the test to reflect this

Lightning-AI / lightning-thunder

Proxy order KeyError fix #1067

What does this PR do?

Did you have fun?

1136 fixes a different problem. We still need this fix.