[AIEObjectFifoStatefulTransform] Fix consumerDims array out-of-bounds access

Xilinx / mlir-aie

An MLIR-based toolchain for AMD AI Engine-enabled devices.

Other

260 stars 76 forks source link

[AIEObjectFifoStatefulTransform] Fix consumerDims array out-of-bounds access #1587

Open jtuyls opened 1 week ago

jtuyls commented 1 week ago

In case of multiple ObjectFifo consumers with a single implicit fromStream BD layout attribute, the consumerDims array can contain a single element and consumerDims[consumerIndex] can be an out-of-bounds access. Weird enough, this should happen all the time, but didn't lead to an issue until I started encountering it in iree-amd-aie with bumped MLIR-AIE/IREE/LLVM versions.

AndraBisca commented 1 week ago

Just want to check if I understand the reasoning here. Is an example of this then an ObjectFifo describing a broadcast (i.e., multiple consumers) where all the consumers will receive the data in the layout described by a single fromStream attribute? In its current syntax, the fromStream attribute follows after each consumerTile to describe the data layout that that tile will receive.

jtuyls commented 1 week ago

Just want to check if I understand the reasoning here. Is an example of this then an ObjectFifo describing a broadcast (i.e., multiple consumers) where all the consumers will receive the data in the layout described by a single fromStream attribute? In its current syntax, the fromStream attribute follows after each consumerTile to describe the data layout that that tile will receive.

Yes

AndraBisca commented 1 week ago

I think that doesn't make sense with the current MLIR syntax then. It would make sense to say that the consumerDims can have a size different than the number of consumers, where consumers that don't have a data layout transformation do not access the consumerDims array (which would lead to an index out of bounds). Otherwise, how would we differentiate between a broadcast where the data layout transformation needs to be broadcasted to all consumers, and one where only the tile for which the data layout transformation is provided should receive it?

AndraBisca commented 1 week ago

Ah I see a change in the PR name. I assume then that this PR fixes the former part of my previous comment, Could you please add a unit test to showcase this fix?

jtuyls commented 1 week ago

I think that doesn't make sense with the current MLIR syntax then. It would make sense to say that the consumerDims can have a size different than the number of consumers, where consumers that don't have a data layout transformation do not access the consumerDims array (which would lead to an index out of bounds). Otherwise, how would we differentiate between a broadcast where the data layout transformation needs to be broadcasted to all consumers, and one where only the tile for which the data layout transformation is provided should receive it?

I am not changing anything. This case currently comes up if there is an implicit BDDimLayoutArrayAttr for every consumer, leading to an out-of-bounds access.

AndraBisca commented 1 week ago

I think that doesn't make sense with the current MLIR syntax then. It would make sense to say that the consumerDims can have a size different than the number of consumers, where consumers that don't have a data layout transformation do not access the consumerDims array (which would lead to an index out of bounds). Otherwise, how would we differentiate between a broadcast where the data layout transformation needs to be broadcasted to all consumers, and one where only the tile for which the data layout transformation is provided should receive it?

I am not changing anything. This case currently comes up if there is an implicit BDDimLayoutArrayAttr for every consumer, leading to an out-of-bounds access.

What does an "implicit" transformation mean? My point is that either there is no transformations at all for any of the consumers or there is one explicitly written for each consumer that requires one.

jtuyls commented 1 week ago

I think that doesn't make sense with the current MLIR syntax then. It would make sense to say that the consumerDims can have a size different than the number of consumers, where consumers that don't have a data layout transformation do not access the consumerDims array (which would lead to an index out of bounds). Otherwise, how would we differentiate between a broadcast where the data layout transformation needs to be broadcasted to all consumers, and one where only the tile for which the data layout transformation is provided should receive it?

I am not changing anything. This case currently comes up if there is an implicit BDDimLayoutArrayAttr for every consumer, leading to an out-of-bounds access.

What does an "implicit" transformation mean? My point is that either there is no transformations at all for any of the consumers or there is one explicitly written for each consumer that requires one.

Well, that's not what I am seeing. When I execute within iree-amd-aie, I see a single BDDimLayoutArrayAttr for multiple consumers. Somehow, this doesn't seem to be the case when I try within standalone MLIR-AIE..

Example:

aie.objectfifo @obj2(%tile_0_1 toStream [<size = 8, stride = 4>, <size = 32, stride = 32>, <size = 4, stride = 1>], {%tile_0_2, %tile_0_3}, 4 : i32) : !aie.objectfifo<memref<1024xi32, 1>>

Implicit means that there is no explicit fromStream access pattern specified, like in this test: https://github.com/Xilinx/mlir-aie/blob/4c12592be13d8f62d4eec265dca94e37a9e2dc21/test/objectFifo-stateful-transform/nd_dma_distribute_broadcast_AIE2.mlir#L172

So, there is an existing test that should showcase this behaviour, but somehow, it doesn't within standalone MLIR-AIE. Looking into it..

AndraBisca commented 1 week ago

I think that doesn't make sense with the current MLIR syntax then. It would make sense to say that the consumerDims can have a size different than the number of consumers, where consumers that don't have a data layout transformation do not access the consumerDims array (which would lead to an index out of bounds). Otherwise, how would we differentiate between a broadcast where the data layout transformation needs to be broadcasted to all consumers, and one where only the tile for which the data layout transformation is provided should receive it?

I am not changing anything. This case currently comes up if there is an implicit BDDimLayoutArrayAttr for every consumer, leading to an out-of-bounds access.

What does an "implicit" transformation mean? My point is that either there is no transformations at all for any of the consumers or there is one explicitly written for each consumer that requires one.

Well, that's not what I am seeing. When I execute within iree-amd-aie, I see a single BDDimLayoutArrayAttr for multiple consumers. Somehow, this doesn't seem to be the case when I try within standalone MLIR-AIE..

Example:
aie.objectfifo @obj2(%tile_0_1 toStream [<size = 8, stride = 4>, <size = 32, stride = 32>, <size = 4, stride = 1>], {%tile_0_2, %tile_0_3}, 4 : i32) : !aie.objectfifo<memref<1024xi32, 1>>
Implicit means that there is no explicit fromStream access pattern specified, like in this test:

https://github.com/Xilinx/mlir-aie/blob/4c12592be13d8f62d4eec265dca94e37a9e2dc21/test/objectFifo-stateful-transform/nd_dma_distribute_broadcast_AIE2.mlir#L172

So, there is an existing test that should showcase this behaviour, but somehow, it doesn't within standalone MLIR-AIE. Looking into it..

Ah you meant the toStream attribute I believe then.. Not the fromStream ones. In this case there are no fromStream attributes at all. The toStream attribute will lower to the DMA of the producer while each of the fromStream attributes will lower to the DMAs of each consumerTile (if there is a fromStream attribute for that tile).

jtuyls commented 1 week ago

I think that doesn't make sense with the current MLIR syntax then. It would make sense to say that the consumerDims can have a size different than the number of consumers, where consumers that don't have a data layout transformation do not access the consumerDims array (which would lead to an index out of bounds). Otherwise, how would we differentiate between a broadcast where the data layout transformation needs to be broadcasted to all consumers, and one where only the tile for which the data layout transformation is provided should receive it?

I am not changing anything. This case currently comes up if there is an implicit BDDimLayoutArrayAttr for every consumer, leading to an out-of-bounds access.

What does an "implicit" transformation mean? My point is that either there is no transformations at all for any of the consumers or there is one explicitly written for each consumer that requires one.

Well, that's not what I am seeing. When I execute within iree-amd-aie, I see a single BDDimLayoutArrayAttr for multiple consumers. Somehow, this doesn't seem to be the case when I try within standalone MLIR-AIE.. Example:
aie.objectfifo @obj2(%tile_0_1 toStream [<size = 8, stride = 4>, <size = 32, stride = 32>, <size = 4, stride = 1>], {%tile_0_2, %tile_0_3}, 4 : i32) : !aie.objectfifo<memref<1024xi32, 1>>
Implicit means that there is no explicit fromStream access pattern specified, like in this test: https://github.com/Xilinx/mlir-aie/blob/4c12592be13d8f62d4eec265dca94e37a9e2dc21/test/objectFifo-stateful-transform/nd_dma_distribute_broadcast_AIE2.mlir#L172

So, there is an existing test that should showcase this behaviour, but somehow, it doesn't within standalone MLIR-AIE. Looking into it..
Ah you meant the toStream attribute I believe then.. Not the fromStream ones. In this case there are no fromStream attributes at all.

No, I mean fromStream.

In this case there are no fromStream attributes at all.

This is what I mean with implicit fromStream

AndraBisca commented 1 week ago

I think that doesn't make sense with the current MLIR syntax then. It would make sense to say that the consumerDims can have a size different than the number of consumers, where consumers that don't have a data layout transformation do not access the consumerDims array (which would lead to an index out of bounds). Otherwise, how would we differentiate between a broadcast where the data layout transformation needs to be broadcasted to all consumers, and one where only the tile for which the data layout transformation is provided should receive it?

I am not changing anything. This case currently comes up if there is an implicit BDDimLayoutArrayAttr for every consumer, leading to an out-of-bounds access.

What does an "implicit" transformation mean? My point is that either there is no transformations at all for any of the consumers or there is one explicitly written for each consumer that requires one.

Well, that's not what I am seeing. When I execute within iree-amd-aie, I see a single BDDimLayoutArrayAttr for multiple consumers. Somehow, this doesn't seem to be the case when I try within standalone MLIR-AIE.. Example:
aie.objectfifo @obj2(%tile_0_1 toStream [<size = 8, stride = 4>, <size = 32, stride = 32>, <size = 4, stride = 1>], {%tile_0_2, %tile_0_3}, 4 : i32) : !aie.objectfifo<memref<1024xi32, 1>>
Implicit means that there is no explicit fromStream access pattern specified, like in this test: https://github.com/Xilinx/mlir-aie/blob/4c12592be13d8f62d4eec265dca94e37a9e2dc21/test/objectFifo-stateful-transform/nd_dma_distribute_broadcast_AIE2.mlir#L172

So, there is an existing test that should showcase this behaviour, but somehow, it doesn't within standalone MLIR-AIE. Looking into it..
Ah you meant the toStream attribute I believe then.. Not the fromStream ones. In this case there are no fromStream attributes at all.
No, I mean fromStream.

In this case there are no fromStream attributes at all.

This is what I mean with implicit fromStream

I think we need to look at some examples and define what is the legal syntax and how it will lower:

This:

aie.objectfifo @obj2(%tile_0_1 toStream [<size = 8, stride = 4>, <size = 32, stride = 32>, <size = 4, stride = 1>], {%tile_0_2, %tile_0_3}, 4 : i32) : !aie.objectfifo<memref<1024xi32, 1>>

describes a broadcast from %tile_0_1 to %tile_0_2 and %tile_0_3. The toStream attribute will lower to the BDDimLayoutArrayAttr in the DMABDOps of %tile_0_1's DMA. As there are no fromStream attributes, there will be no BDDimLayoutArrayAttr in the DMABDOps of the consumer tiles' DMAs.

This:

aie.objectfifo @obj2(%tile_0_1 toStream [<size = 8, stride = 4>, <size = 32, stride = 32>, <size = 4, stride = 1>], {%tile_0_2 fromStream [...], %tile_0_3}, 4 : i32) : !aie.objectfifo<memref<1024xi32, 1>>

describes a broadcast from %tile_0_1 to %tile_0_2 and %tile_0_3. The toStream attribute will lower as above. In this example there is a fromStream attribute for %tile_0_2 which will lower to the BDDimLayoutArrayAttr in the DMABDOps of %tile_0_2's DMA. As there is no fromStream attribute for %tile_0_3 there will be no BDDimLayoutArrayAttr in the DMABDOps of this tile's DMA. If that is not the case in the current lowering, then indeed it should be an error. But the fix to that error must lower as described. The fromStream of %tile_0_2 shouldn't be applied to %tile_0_3.

jtuyls commented 1 week ago

As there are no fromStream attributes, there will be no BDDimLayoutArrayAttr in the DMABDOps of the consumer tiles DMAs.

This is not true. If you run the above test, you will see that an implicit fromStream results in multiple consumer dims. Otherwise, you would see this out-of-bounds error all the time (consumerDims[consumerIndex]).

https://github.com/Xilinx/mlir-aie/blob/4c12592be13d8f62d4eec265dca94e37a9e2dc21/test/objectFifo-stateful-transform/nd_dma_distribute_broadcast_AIE2.mlir#L172

AndraBisca commented 1 week ago

As there are no fromStream attributes, there will be no BDDimLayoutArrayAttr in the DMABDOps of the consumer tiles DMAs.

This is not true. If you run the above test, you will see that an implicit fromStream results in multiple consumer dims. Otherwise, you would see this out-of-bounds error all the time (consumerDims[consumerIndex]).

https://github.com/Xilinx/mlir-aie/blob/4c12592be13d8f62d4eec265dca94e37a9e2dc21/test/objectFifo-stateful-transform/nd_dma_distribute_broadcast_AIE2.mlir#L172

If that is the case then the test should've failed. The expected output is this https://github.com/Xilinx/mlir-aie/blob/4c12592be13d8f62d4eec265dca94e37a9e2dc21/test/objectFifo-stateful-transform/nd_dma_distribute_broadcast_AIE2.mlir#L99

jtuyls commented 1 week ago

As there are no fromStream attributes, there will be no BDDimLayoutArrayAttr in the DMABDOps of the consumer tiles DMAs.

This is not true. If you run the above test, you will see that an implicit fromStream results in multiple consumer dims. Otherwise, you would see this out-of-bounds error all the time (consumerDims[consumerIndex]). https://github.com/Xilinx/mlir-aie/blob/4c12592be13d8f62d4eec265dca94e37a9e2dc21/test/objectFifo-stateful-transform/nd_dma_distribute_broadcast_AIE2.mlir#L172

If that is the case then the test should've failed. The expected output is this

https://github.com/Xilinx/mlir-aie/blob/4c12592be13d8f62d4eec265dca94e37a9e2dc21/test/objectFifo-stateful-transform/nd_dma_distribute_broadcast_AIE2.mlir#L99

It's not failing, so consumerDims is not empty, it's actually the size of the number of consumers. However, when see this case within iree-amd-aie, the array actually has size 1.

AndraBisca commented 1 week ago

As there are no fromStream attributes, there will be no BDDimLayoutArrayAttr in the DMABDOps of the consumer tiles DMAs.

This is not true. If you run the above test, you will see that an implicit fromStream results in multiple consumer dims. Otherwise, you would see this out-of-bounds error all the time (consumerDims[consumerIndex]). https://github.com/Xilinx/mlir-aie/blob/4c12592be13d8f62d4eec265dca94e37a9e2dc21/test/objectFifo-stateful-transform/nd_dma_distribute_broadcast_AIE2.mlir#L172

If that is the case then the test should've failed. The expected output is this https://github.com/Xilinx/mlir-aie/blob/4c12592be13d8f62d4eec265dca94e37a9e2dc21/test/objectFifo-stateful-transform/nd_dma_distribute_broadcast_AIE2.mlir#L99

It's not failing, so consumerDims is not empty, it's actually the size of the number of consumers. However, when see this case within iree-amd-aie, the array actually has size 1.

I will take a look at the test.

andrej commented 1 week ago

I think we are having two discussions here: (1) a bug due to an out-of-bounds access and (2) a discussion on the semantics of fromStream if it is not explicitly specified for all consumers.

As for (1): parseObjectFifoConsumerTiles() is supposed to initialize an empty array if no transformations are explicitly specified:

    // By default, create empty dimensions array for each consumer; this way,
    // we can be certain to have as many entries in the dimensions array as
    // there are customer
    BDDimLayoutArrayAttr dimAttr =
        BDDimLayoutArrayAttr::get(parser.getContext(), {});

Therefore, this access in the ObjectFifoStatefulTransform should always suceed (if no fromStream is specified, it should return an emtpy array as created above):

 ArrayRef<BDDimLayoutAttr>{consumerDims[consumerIndex]}

@jtuyls Can you share the example where you are seeing the bug/crash?

As for (2), if I believe the current semantics are supposed to be: fromStream transformations are per-consumer. If a consumer does not explicitly specify a transformation, but others do, that consumer will not have a transformation. Transformations are not supposed to be broadcast. If we are currently doing that anywhere, that would be wrong, imo.

Maybe let's fix (1) first and discuss (2) separately?

jtuyls commented 6 days ago

I think we are having two discussions here: (1) a bug due to an out-of-bounds access and (2) a discussion on the semantics of fromStream if it is not explicitly specified for all consumers.

As for (1): parseObjectFifoConsumerTiles() is supposed to initialize an empty array if no transformations are explicitly specified:
    // By default, create empty dimensions array for each consumer; this way,
    // we can be certain to have as many entries in the dimensions array as
    // there are customer
    BDDimLayoutArrayAttr dimAttr =
        BDDimLayoutArrayAttr::get(parser.getContext(), {});
Therefore, this access in the ObjectFifoStatefulTransform should always suceed (if no fromStream is specified, it should return an emtpy array as created above):
 ArrayRef<BDDimLayoutAttr>{consumerDims[consumerIndex]}

The problem is that this logic that creates empty BDDimLayoutArrayAttr objects for every consumer is part of the parser and it's possible to create in-memory aie.objectfifo operations that have any number of layout attributes. This can be fixed by either:

Supporting these cases. Imo, specifying 0 or 1 layout attributes should be valid, see discussion here: https://github.com/Xilinx/mlir-aie/pull/1587#discussion_r1661102475
Updating the verifier to make ensure nb_consumers == nb_layout_attrs

AndraBisca commented 5 days ago

I think the second solution is prefferable because it would better test the currently intended behavior. Maybe instead of having the parser insert empty dim arrays for consumers, this should be done in the verifier of the object fifo, with a check just after? What do you think @andrej ? I can't remember if there was a particular reason we decided to add it to the parser originally.

andrej commented 5 days ago

I think I put it there. It made sense to me because the parser is where we create the object. It would be weird to me to have a verifier change the object it is supposed to verify. The name verifier suggests to me it only looks at the object, does not modify it.

However, if that's the pragmatic solution, we can move it there.

As an alternative, which I think @jtuyls might have already sort of suggested, we could just add an assertion to the verifier. If someone wants to create the object in-memory, without going through the parser, they can still do so, but they must initialize consumerDims correctly, or the assertion fails. If we go this route, adding an instructive comment right next to the assertion would probably be good.