daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

OneHot kernel does not check bounds #687

Closed AlexRTer closed 7 months ago

AlexRTer commented 7 months ago

The OneHot kernel at src/runtime/local/kernels/OneHot.h currently does not check for bounds when encoding a row (l. 96).
Given a big enough value in arg this leads to overwriting memory outside of the DenseMatrix values array. An example of this is

infoMat = [2];
argMat = [10];

print(infoMat);
print(argMat);
print(oneHot(argMat, infoMat));

The index this writes to is 10 whereas the result matrix has shape 1x2. This often crashes during execution too.

Should an added bounds check simply ignore the value and log an error or throw an error instead?

pdamme commented 7 months ago

Thanks for reporting this issue, @AlexRTer! Bounds checks are always a trade-off between efficiency and safety/correctness. However, I think it would be better to insert a check here. If the runtime of oneHot() should ever become a problem, we can still think of ways to make it more efficient.

More precisely, we should check if valuesArg[carg] is in bounds, i.e., valueArg[carg] >= 0 && valuesArg[carg] < numDistinct]. If this condition is not satisfied, we should better throw an exception.

Writing beyond the end of the output buffer is just one problem that can occur without a bounds check. Besides that, the kernel could also write to some later position in the buffer that should actually stay a zero.

AlexRTer commented 7 months ago

I noticed that the documentation doc/DaphneDSL/Builtins.md allows for the info parameter to be -1 or any integer greater or equal to 0. However, the kernel itself l. 66-72 implies info should be a positive (non-zero) integer.

This is probably a typo in the documentation but as info determines the length of the vector for the OneHot encoding in a column, setting info to 0 could also be added as an option to ignore the corresponding arg column completely in res instead of copying or encoding the data. I am not sure how intuitive this would be though.

pdamme commented 7 months ago

Hi @AlexRTer, the kernel itself allows that the info of a column is -1. However, the kernel does not allow 0, while the docs say >= 0 is allowed. The simplest solution would be to correct the docs (>= 0 to > 0). However, I actually like the idea of using a 0 info to express that the corresponding column should be omited in the output. This provides a nice way to get only the one-hot-encoded columns, if desired. Feel free to implement this variant and update the docs.