JuliaData / CSV.jl

Utility library for working with CSV and other delimited files in the Julia programming language
https://csv.juliadata.org/
Other
459 stars 141 forks source link

Incorrect results for `argmax` with multithreaded parsing #1128

Open yurivish opened 4 months ago

yurivish commented 4 months ago

Hello. I was trying to find the index of the maximum value in a data frame that I loaded using CSV.jl.

Julia has a function called argmax which does this.

argmax(itr)

Return the index or key of the maximal element in a collection. If there are multiple maximal elements, then the first one will be returned.

Unfortunately, the results I was getting back were incorrect – I called argmax on an integer column and got back an answer that was a valid index, but not the index of the maximum value.

After some digging, I was able to reduce the issue to a small example, which you can find below.

julia> using SentinelArrays

julia> arrays = [
         [18, 70, 92, 15, 65],
         [25, 14, 95, 54, 57]
       ];

julia> cv = ChainedVector(arrays);

julia> argmax(cv)
95

julia> argmax(collect(cv))
8

Due to multithreading, the CSV column was loaded in as a chained vector, which comes from a package SentinelArrays, which incorrectly implements argmax. It returns the maximum value, rather than its index.

The implementation is here and is tested, though the test only tests a simple special case for which the values and indices are exactly equal.

This seems like a fairly serious bug to me because it allows silently incorrect results to be returned during data processing and analysis (this is what happened to me). If I collect the column before calling argmax, then correct value is returned. If I load the file in with CSV.read(..., ntasks=1) I also see the correct results.

I saw that SentinelArrays has a recent commit to "fix out-of-bounds bug in BufferedVector" and a few other indications of potential hidden issues with that package. In my experience, CSV and DataFrames have tended to have less bugs than other parts of the Julia ecosystem so I wanted to file this bug upstream in case removing the dependency is a reasonable idea in this case.