Handles non-contiguous input strides

vedaanta commented 1 week ago

Before submitting

- [x] Was this discussed/approved via a Github issue? (no need for typos and docs improvements) - [x] Did you read the [contributor guideline](https://github.com/Lightning-AI/pytorch-lightning/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [x] Did you make sure to update the docs? - [x] Did you write any new necessary tests?

What does this PR do?

Earlier non-contiguous input strides were being overwritten during actual execution. With this change, correct strides from torch tensor will be passed to make cudnn graphs.

The main change in this PR is both fwd and bwd cudnn graph makers take in explicit input strides.

If in execution phase, directly use torch.Tensor.stride()
If in checker phase, assume row major strides and check cudnn support.

CudnnTensorAttributes has been removed from cudnnex. It is a vestige from earlier caching method, which can be seen in the stale cudnn_layernorm_ex.py.

Fixes #583 .

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

vedaanta commented 3 days ago

[ ] Wait for cudnn-fe v1.5.2. It will fix the pipeline failures.

tfogal commented 3 days ago

[ ] Wait for cudnn v1.5.2. It will fix the pipeline failures.

I assume you mean cudnn-fe? Is it reasonable to conditionalize the code (or just the test?) on 1.5.2? I worry about always requiring specific versions of packages, as I've seen other projects become very difficult to install due to long dependency chains with such requirements.

vedaanta commented 2 days ago

[ ] Wait for confirmation that these changes do solve #583

vedaanta commented 2 days ago

Is it reasonable to conditionalize the code (or just the test?) on 1.5.2? I worry about always requiring specific versions of packages, as I've seen other projects become very difficult to install due to long dependency chains with such requirements.

@tfogal I bumped up the minimum required cudnn-fe version to 1.5.2. cudnn-fe is meant to be forward and backward compatible with cudnn backend. Plus, cudnn-fe itself is backward compatible. So in case users have say cudnn-fe 1.3, they can easily move to 1.5.2 and it will work fine. (albeit some newer APIs will not work as expected.)

t-vi commented 2 days ago

Thank you all!

Lightning-AI / lightning-thunder