Closed AlexRTer closed 7 months ago
Thanks for reporting this issue, @AlexRTer! Bounds checks are always a trade-off between efficiency and safety/correctness. However, I think it would be better to insert a check here. If the runtime of oneHot()
should ever become a problem, we can still think of ways to make it more efficient.
More precisely, we should check if valuesArg[carg]
is in bounds, i.e., valueArg[carg] >= 0 && valuesArg[carg] < numDistinct]
. If this condition is not satisfied, we should better throw an exception.
Writing beyond the end of the output buffer is just one problem that can occur without a bounds check. Besides that, the kernel could also write to some later position in the buffer that should actually stay a zero.
I noticed that the documentation doc/DaphneDSL/Builtins.md allows for the info
parameter to be -1
or any integer greater or equal to 0. However, the kernel itself l. 66-72 implies info
should be a positive (non-zero) integer.
This is probably a typo in the documentation but as info
determines the length of the vector for the OneHot encoding in a column, setting info
to 0 could also be added as an option to ignore the corresponding arg
column completely in res
instead of copying or encoding the data. I am not sure how intuitive this would be though.
Hi @AlexRTer, the kernel itself allows that the info
of a column is -1
. However, the kernel does not allow 0
, while the docs say >= 0
is allowed. The simplest solution would be to correct the docs (>= 0
to > 0
). However, I actually like the idea of using a 0
info to express that the corresponding column should be omited in the output. This provides a nice way to get only the one-hot-encoded columns, if desired. Feel free to implement this variant and update the docs.
The OneHot kernel at src/runtime/local/kernels/OneHot.h currently does not check for bounds when encoding a row (l. 96).
Given a big enough value in
arg
this leads to overwriting memory outside of the DenseMatrix values array. An example of this isThe index this writes to is
10
whereas the result matrix has shape1x2
. This often crashes during execution too.Should an added bounds check simply ignore the value and log an error or throw an error instead?