cyanguwa commented 3 weeks ago

Description

This PR reduces the THD offset tensors from four (seq_offsets_q, seq_offsets_k, seq_offsets_v, seq_offsets_o) to two (cu_seqlens_q_padded, cu_seqlens_kv_padded).

Before this PR, for THD_THD_THD layout, users need to calculate these four tensors:

seq_offsets_q =           config.num_heads * config.head_dim * cu_seqlens_q_padded
seq_offsets_k = config.num_gqa_groups * config.head_dim * cu_seqlens_kv_padded
seq_offsets_v = config.num_gqa_groups * config.head_dim * cu_seqlens_kv_padded
seq_offsets_o =            config.num_heads * config.head_dim * cu_seqlens_q_padded

With this PR, users only need to provide two tensors, cu_seqlens_q_padded and cu_seqlens_kv_padded, which are easier to understand and utilize correctly.

An example of the difference between cu_seqlens and cu_seqlens_padded is, for a batch [a, PAD, b, b, c, PAD, PAD, d, d], we have 4 sequences, cu_seqlens = [0, 1, 3, 4, 6], and cu_seqlens_padded= [0, 2, 4, 7, 9].

Type of change

[ ] Documentation change (change only to the documentation, either a fix or a new content)
[ ] Bug fix (non-breaking change which fixes an issue)
[ x] New feature (non-breaking change which adds functionality)
[ x] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Infra/Build change
[ ] Code refractor

Changes

Please list the changes introduced in this PR:

Reduced the THD offset tensors from four to two
This is breaking change compared to #832

Checklist:

[ x] I have read and followed the contributing guidelines
[ x] The functionality is complete
[ x] I have commented my code, particularly in hard-to-understand areas
[ x] I have made corresponding changes to the documentation
[ x] My changes generate no new warnings
[ x] I have added tests that prove my fix is effective or that my feature works
[ x] New and existing unit tests pass locally with my changes

cyanguwa commented 2 weeks ago

/te-ci pytorch

cyanguwa commented 2 weeks ago

/te-ci jax

cyanguwa commented 2 weeks ago

/te-ci pytorch

cyanguwa commented 2 weeks ago

/te-ci pytorch

cyanguwa commented 2 weeks ago

/te-ci pytorch

cyanguwa commented 2 weeks ago

/te-ci jax

cyanguwa commented 2 weeks ago

/te-ci paddle

cyanguwa commented 2 weeks ago

/te-ci paddle

zlsh80826 commented 2 weeks ago

Hi @cyanguwa, I remembered that we have 3 API changes are pending

Support separate q/kv acutal_seqlen, offsets for qkvpacked API
Support seqlens to avoid cu_seqlens -> seqlens -> cu_seqlens
Simplify THD format APIs (this PR)

Do you have any estimate time for item 1. 2.? Should we also change them in this PR?

cyanguwa commented 2 weeks ago

Hi @zlsh80826 ,

Yes, this PR is just focused on item 3. I wanted to get this done first so there is no API change between v1.8 and v1.9. I'm still evaluating the benefits/changes for items 1 and 2, but they will cause breaking API changes anyway. There is no urgency to it (well, not as much as item 3, given the code freeze coming up).

Thanks for reviewing.

NVIDIA / TransformerEngine

[C/PyTorch] Simplify THD offset tensors #927

Description

Type of change

Changes

Checklist: