Patch vectorization on permuted inputs for PadOp

NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

Other

271 stars 53 forks source link

Patch vectorization on permuted inputs for PadOp #3439

Closed jjsjann123 closed 1 day ago

jjsjann123 commented 3 days ago

What's in this PR:

Avoid vectorization validation check on consumer of vectorized op. We believe it's safe to do so, since the allocation itself doesn't use that allocation domain.
Add a cpp test where the consumer of PadOp has an allocation domain that's not consistent with the producer.

For future reference: Alternatively, we can add a set after PadOp to mimic a cache on input. This allows us to propagate the allocation domain from input to the output of PadOp, which is the consumer of the vectorized op; while still preserving the allocation domain on the original out and propagate it to be the output from set. We decided not to pursue that, because the validation doesn't seem to be a right check.

jjsjann123 commented 3 days ago

!test

jjsjann123 commented 3 days ago

!test

jjsjann123 commented 3 days ago

!test

jjsjann123 commented 3 days ago

!test

jjsjann123 commented 3 days ago

!test

jjsjann123 commented 2 days ago

I understand that the validation analysis correctly complained about the vectorized ID not being as the innermost ID. But then why didn't the vectorization analysis give up vectorizing the pad op? Doesn't the vectorization analysis also check the allocation domains of intermediate tensors?

Vectorizatoin analysis is done on an unscheduled fusion. i.e. In the analysis we assumes that the scheduler is smart enough to apply the correct transformation to allow vectorization.

Vectorization analysis only checks the allocation domain on input/ouput tensors. We don't care about allocation domain on intermediate, because we insert cacheBefore/cacheAfter and propagate the allocation domain on the cached TVs.

naoyam commented 2 days ago

I'm wondering if this may be actually an issue of the validation. The allocation domain of the consumer tensor doesn't actually matter, so why should it be validated?

jjsjann123 commented 2 days ago

!test

jjsjann123 commented 1 day ago

Looks like failures are coming from MultiheadAttention_SP and doesn't seem to be related. (saw internal discussion on that)

@naoyam does this look like the right skip you wanted?

jjsjann123 commented 1 day ago

!build