Kernels read keys and values out of bounds

natevm commented 11 months ago

Hello,

I've recently discovered that if the key/value buffers used in the sort are not a multiple of PARALLELSORT_THREADGROUP_SIZE * 4, then the buffers are read out of bounds and undefined behavior can occur.

See these lines below: https://github.com/GPUOpen-Effects/FidelityFX-ParallelSort/blob/0c539948c8d196ae338d91efbc8ca495f1ea0d1d/ffx-parallelsort/FFX_ParallelSort.h#L133-L137

No bounds checks are done to SrcBuffer here, causing GPU instability when these are read out of bounds.

Also an issue here:

https://github.com/GPUOpen-Effects/FidelityFX-ParallelSort/blob/0c539948c8d196ae338d91efbc8ca495f1ea0d1d/ffx-parallelsort/FFX_ParallelSort.h#L348-L360

Later on, the number of keys is checked, but by that point it's too late: https://github.com/GPUOpen-Effects/FidelityFX-ParallelSort/blob/0c539948c8d196ae338d91efbc8ca495f1ea0d1d/ffx-parallelsort/FFX_ParallelSort.h#L369-L371

I suspect the fix would be to just check the number of keys before pre-loading the key/value pairs.

Reproducing is simple enough, just run the sort on data that is less than PARALLELSORT_THREADGROUP_SIZE * 4 with GPU-assisted validation that checks out of bounds descriptor reads.

jlacroixAMD commented 11 months ago

Thank you for reporting this. I'll file a ticket internally and we'll get it fixed in the next release.

jlacroixAMD commented 11 months ago

As I just realized this was reported on the old Parallel Sort sample, a fix to address this will be pushed with the next version of the FidelityFX SDK (which is how we are pushing out most updates to our older features now - https://github.com/GPUOpen-LibrariesAndSDKs/FidelityFX-SDK).

Also, in order to keep the GPU code as fast as possible, the fix will likely be done as a check on the NumKeys value at CPU time with an error code returned in the data setup stage.

natevm commented 11 months ago

Fwiw this sort implementation has been very helpful.

Small feature request, it would be great to also have a separate dedicated prefix sum/scan and a parallel device selection / compaction. Both of these are used internally by the radix sorter, but would also be helpful standalone.

jlacroixAMD commented 11 months ago

I'll add it to the list of planned improvements to existing samples. Cheers!

GPUOpen-Effects / FidelityFX-ParallelSort

Kernels read keys and values out of bounds #4