timmoon10 commented 2 months ago

Description

This PR modifies the operation-based API (https://github.com/NVIDIA/TransformerEngine/pull/707) to support some simple branching behavior: operations can now accept extra tensor inputs and generate extra tensor outputs. This enables fusions like GEMMs with beta=1:

model = te.Sequential(
    MakeExtraOutput(),
    Linear(...),
    AddInPlace(),
)
y, linear_in = model(x, linear_out)  # GEMM with beta=1 into linear_out
...
loss.backward()  # dgrad GEMM with beta=1 into linear_in.grad

Support for multiple inputs will also be necessary for cross-attention (and SSMs?). Note that we are not planning to support more complicated structures since that will take us down the road of general graph compilers.

Type of change

[ ] Documentation change (change only to the documentation, either a fix or a new content)
[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Infra/Build change
[ ] Code refractor

Changes

Support extra tensor inputs and outputs in operation-based API
Operation for in-place add
Operation for making extra tensor output
Fused operations for GEMM with beta=1

Checklist:

[x] I have read and followed the contributing guidelines
[x] The functionality is complete
[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes

timmoon10 commented 2 months ago

/te-ci pytorch

timmoon10 commented 2 months ago

/te-ci pytorch

timmoon10 commented 2 months ago

/te-ci pytorch

timmoon10 commented 2 months ago

/te-ci pytorch

ptrendx commented 1 month ago

Could you comment on how the change from your last commit helped with the unittest failures? The change from list comprehension to the for loop should not change the behavior, right?

timmoon10 commented 1 month ago

/te-ci pytorch

NVIDIA / TransformerEngine

[PyTorch] Branching operations #1027

Description

Type of change

Changes

Checklist: