Hello. I was trying to find the index of the maximum value in a data frame that I loaded using CSV.jl.
Julia has a function called argmax which does this.
argmax(itr)
Return the index or key of the maximal element in a collection. If there are multiple maximal elements, then the first one will be returned.
Unfortunately, the results I was getting back were incorrect – I called argmax on an integer column and got back an answer that was a valid index, but not the index of the maximum value.
After some digging, I was able to reduce the issue to a small example, which you can find below.
Due to multithreading, the CSV column was loaded in as a chained vector, which comes from a package SentinelArrays, which incorrectly implements argmax. It returns the maximum value, rather than its index.
The implementation is here and is tested, though the test only tests a simple special case for which the values and indices are exactly equal.
This seems like a fairly serious bug to me because it allows silently incorrect results to be returned during data processing and analysis (this is what happened to me). If I collect the column before calling argmax, then correct value is returned. If I load the file in with CSV.read(..., ntasks=1) I also see the correct results.
I saw that SentinelArrays has a recent commit to "fix out-of-bounds bug in BufferedVector" and a few other indications of potential hidden issues with that package. In my experience, CSV and DataFrames have tended to have less bugs than other parts of the Julia ecosystem so I wanted to file this bug upstream in case removing the dependency is a reasonable idea in this case.
Hello. I was trying to find the index of the maximum value in a data frame that I loaded using CSV.jl.
Julia has a function called argmax which does this.
Unfortunately, the results I was getting back were incorrect – I called
argmax
on an integer column and got back an answer that was a valid index, but not the index of the maximum value.After some digging, I was able to reduce the issue to a small example, which you can find below.
Due to multithreading, the CSV column was loaded in as a chained vector, which comes from a package SentinelArrays, which incorrectly implements
argmax
. It returns the maximum value, rather than its index.The implementation is here and is tested, though the test only tests a simple special case for which the values and indices are exactly equal.
This seems like a fairly serious bug to me because it allows silently incorrect results to be returned during data processing and analysis (this is what happened to me). If I
collect
the column before callingargmax
, then correct value is returned. If I load the file in withCSV.read(..., ntasks=1)
I also see the correct results.I saw that
SentinelArrays
has a recent commit to "fix out-of-bounds bug in BufferedVector" and a few other indications of potential hidden issues with that package. In my experience, CSV and DataFrames have tended to have less bugs than other parts of the Julia ecosystem so I wanted to file this bug upstream in case removing the dependency is a reasonable idea in this case.