JuliaData / CSV.jl

Utility library for working with CSV and other delimited files in the Julia programming language
https://csv.juliadata.org/
Other
470 stars 140 forks source link

[Bug] CSV.read randomly changes eltype of column #1089

Closed hungpham3112 closed 1 year ago

hungpham3112 commented 1 year ago

Step to reproduce:

I tested the csv file in Python, the first column is always fixed data type (Float64)-> not the problem with csv file. Then I tried above snippet in Jupyter notebook and Pluto both experience the same bug. -> The problem with CSV.read and CSV.File

Vid:

https://github.com/JuliaData/CSV.jl/assets/75968004/dbcbe99e-85fa-4091-bddf-7a2cd1aa8e01

https://github.com/JuliaData/CSV.jl/assets/75968004/727fc383-d01b-4fa9-be24-d885b4c6024a

Versioninfo:

Julia Version 1.9.0
Commit 8e63055292 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_DEPOT_PATH = C:\Users\sofia\.julia;C:\Users\sofia\.julia\juliaup\julia-1.9.0+0.x64.w64.mingw32\local\share\julia;C:\Users\sofia\.julia\juliaup\julia-1.9.0+0.x64.w64.mingw32\share\julia
  JULIA_LOAD_PATH = C:\Users\sofia\AppData\Local\Temp\jl_MjE6XO;@;@v#.#;@stdlib
  JULIA_NUM_THREADS = 8
  JULIA_PROJECT = C:\Users\sofia\JuliaProjects\MachineLearning\LinearRegression\Project.toml
  JULIA_REVISE_WORKER_ONLY = 1
Liozou commented 1 year ago

Hi and thank you for the bug report! Would you mind testing whether this still occurs after updating CSV.jl? Version 0.10.11 (tagged yesterday) includes https://github.com/JuliaData/CSV.jl/pull/1073 which intends to fix this kind of issues.

hungpham3112 commented 1 year ago

I tested, the data race frequency decreased but the problem is still there. Moreover, now sometimes this plugin causes Pluto to hang for about 5 minutes I think because data racing.

https://github.com/JuliaData/CSV.jl/assets/75968004/f6a5b0f0-fb1c-4b85-94a3-692f6212171a

My thought: if run the code single time, I mean run and wait until the code done -> continue, no problem exist with type. But if we run it many times, like I spam in the video, data racing will happen with multiple core(in my example is 8 cores). Idk if my thought is true or not, please explain for me.

Liozou commented 1 year ago

Ah that's unfortunate and unexpected. It seems I cannot reproduce the issue: I tried running a Pluto notebook with the same environment (JULIA_NUM_THREADS=8 JULIA_REVISE_WORKER_ONLY=1 ~/julia-1.9.0/bin/julia --startup-file=no -e "using Pluto; Pluto.run()") and I put the code of your initial message, one line per cell. Then I did like in your video, refreshing the df definition cell repeatedly, even just leaving Shift+Enter pressed down for a while, but I never see the type of the first column changing. I also tried the following to automate things a bit:

body = HTTP.get(filename).body
for _ in 1:10000
    df2 = CSV.read(body, DataFrame, header=headers)
    if eltype(df2[!,1]) != Int64
        error("Encountered: $(eltype(df2[!,1]))")
    end
end

but no error occurs.

Just to check if it can be something else in the configuration, can you please check the output of Base.Threads.nthreads() in one cell of your Pluto notebook, as well as that of import Pkg; Pkg.status()? Mine yields respectively 8 and

Status `/tmp/jl_pNSR9l/Project.toml`
  [336ed68f] CSV v0.10.11
  [a93c6f00] DataFrames v1.5.0
  [cd3eb016] HTTP v1.9.6
  [44cfe95a] Pkg v1.9.0
  [10745b16] Statistics v1.9.0
hungpham3112 commented 1 year ago

Just to check if it can be something else in the configuration, can you please check the output of Base.Threads.nthreads() in one cell of your Pluto notebook, as well as that of import Pkg; Pkg.status()? Mine yields respectively 8 and

Here is the output: image

Ah that's unfortunate and unexpected. It seems I cannot reproduce the issue: I tried running a Pluto notebook with the same environment (JULIA_NUM_THREADS=8 JULIA_REVISE_WORKER_ONLY=1 ~/julia-1.9.0/bin/julia --startup-file=no -e "using Pluto; Pluto.run()") and I put the code of your initial message, one line per cell. Then I did like in your video, refreshing the df definition cell repeatedly, even just leaving Shift+Enter pressed down for a while, but I never see the type of the first column changing. I also tried the following to automate things a bit: I can reproduce the error with your requirement, maybe your OS is different to me. I'm using Windows 11 to test, with powershell=7.2.

https://github.com/JuliaData/CSV.jl/assets/75968004/fd088aa2-00bc-489c-8cbe-50075e98e442

Liozou commented 1 year ago

Thanks for checking: apparently you are still using CSV v0.10.10, but the bugfix I mentioned was only released starting from with CSV v0.10.11, which explains why you are still seeing this bug. Would you mind updating the package and letting us know whether the bug still occurs afterwards? To update, run Pkg.update("CSV") from a cell of your notebook (or simply Pkg.update() to update all packages in your environment): you should see somewhere a line stating

  [336ed68f] ↑ CSV v0.10.10 ⇒ v0.10.11
hungpham3112 commented 1 year ago

Thanks for checking: apparently you are still using CSV v0.10.10, but the bugfix I mentioned was only released starting from with CSV v0.10.11, which explains why you are still seeing this bug. Would you mind updating the package and letting us know whether the bug still occurs afterwards? To update, run Pkg.update("CSV") from a cell of your notebook (or simply Pkg.update() to update all packages in your environment): you should see somewhere a line stating

  [336ed68f] ↑ CSV v0.10.10 ⇒ v0.10.11

I realized that I only update local env not Pluto. sorry for that. The first time I check, data racing still exist but at the second time and third time everything ok. There's something weird in here or maybe problem with multi threads. We need more people to validate this behavior. Thanks

hungpham3112 commented 1 year ago

Hi, today I come back to the problem and no data racing anymore. My thought was the last time I updated CSV from v0.10.10 => v0.10.11, temporary file still exists in local machine then the bug still occurs. #1073 absolutely fixes this issue. Thanks for the hard working. I will close this issue in here.