yaox12 commented 4 months ago

Description

Grouped GEMM for fp32/bf16/fp16/fp8 via multi-stream cuBLAS. This is for MoE training.

Type of change

[ ] Documentation change (change only to the documentation, either a fix or a new content)
[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

Changes

Please list the changes introduced in this PR:

Add a multi-stream cuBLAS based Grouped GEMM implementation and the corresponding PyTorch binding.
Add a GroupedLinear layer.

Checklist:

[x] I have read and followed the contributing guidelines
[x] The functionality is complete
[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes

phu0ngng commented 4 months ago

/te-ci pytorch

yaox12 commented 4 months ago

Can you trigger the CI again?

yaox12 commented 4 months ago

Hi @phu0ngng, I just finished implementing the GroupedLinear layer and would like to include it in this PR. Could you please review it? Sorry for not having the code to be reviewed all at once.

phu0ngng commented 3 months ago

/te-ci pytorch

phu0ngng commented 3 months ago

LGTM!

phu0ngng commented 3 months ago

/te-ci pytorch

phu0ngng commented 3 months ago

/te-ci pytorch

NVIDIA / TransformerEngine

[Common/PyTorch] Grouped GEMM via multi-stream cuBLAS #853

Description

Type of change

Changes

Checklist: