currently device tile ops assign the device stream by the result tile's range ordinal ... for tasks that fuse multiple ops (e.g., scale + permute) this is not appropriate since the constituent ops may end up launching kernels into different streams, thus potentially violating the sequencing of the ops. The solution is to use the ordinal-based stream assignment only if a stream has not already been assigned.
currently device tile ops assign the device stream by the result tile's range ordinal ... for tasks that fuse multiple ops (e.g., scale + permute) this is not appropriate since the constituent ops may end up launching kernels into different streams, thus potentially violating the sequencing of the ops. The solution is to use the ordinal-based stream assignment only if a stream has not already been assigned.