JuliaData / CSV.jl

Utility library for working with CSV and other delimited files in the Julia programming language
https://csv.juliadata.org/
Other
470 stars 140 forks source link

Feature request: Auto-detect for repeated space delimited file #853

Open rashidrafeek opened 3 years ago

rashidrafeek commented 3 years ago

A lot of files I work with are delimited with repeated spaces. (i.e., delim=' ', ignorerepeated=true). It would be great if this case can be handled by the default auto-detect backend only for spaces. Such files is read correctly by the stdlib, DelimitedFiles and also Pandas.

quinnj commented 3 years ago

I'm not sure what you mean by such files are handled by DelimitedFiles? or pandas? Last I checked, neither do any kind of auto-detection of delimiters, let alone repeated delimiters? Can you share an example of that?

In terms of adding this new feature, I think it's possible, but might be a little tricky; I guess we'd want to track when we encounter a space, and then how many in a row we find, and then.......what, if there isn't a consistent delimiter, we check if there are a bunch of sequences of spaces? If we can come up with an algorithm, then it shouldn't be too hard to add to the detection.jl file.

rashidrafeek commented 3 years ago

Yeah. Here's an example:

julia> using Printf
       text = @sprintf  "%10s%15s%15s\n" "Column1" "Column2" "Column3"
       for i in 1:5
           text *= @sprintf "%10d%15.5f%15.3f\n" i i^2/2 sqrt(i)+30
       end
       open("test.txt","w") do f
           write(f,text)
       end;

shell> cat test.txt
   Column1        Column2        Column3
         1        0.50000         31.000
         2        2.00000         31.414
         3        4.50000         31.732
         4        8.00000         32.000
         5       12.50000         32.236

DelimitedFiles:

julia> using DelimitedFiles
       readdlm("test.txt")
6×3 Matrix{Any}:
  "Column1"    "Column2"    "Column3"
 1            0.5         31.0
 2            2.0         31.414
 3            4.5         31.732
 4            8.0         32.0
 5           12.5         32.236

Pandas has a custom function to read fixed width files, read_fwf():

julia> using PyCall
       pd = pyimport("pandas")
       pd.read_fwf("test.txt")
PyObject    Column1  Column2  Column3
0        1      0.5   31.000
1        2      2.0   31.414
2        3      4.5   31.732
3        4      8.0   32.000
4        5     12.5   32.236

But CSV detects the whole file as many rows with a string as column:

julia> using CSV, DataFrames
       a= CSV.read("test.txt",DataFrame)
5×1 DataFrame
 Row │    Column1        Column2        Column3 
     │ String                                   
─────┼──────────────────────────────────────────
   1 │          1        0.50000         31.000
   2 │          2        2.00000         31.414
   3 │          3        4.50000         31.732
   4 │          4        8.00000         32.000
   5 │          5       12.50000         32.236
rashidrafeek commented 3 years ago

I tried understanding the logic in detection.jl. If its okay, I'll try to come up with a PR.

I think we could try to check for the number of columns if we encounter repeated space. And then confirm that the number of columns are same in the rest of the lines we check (by default it is 10 lines I think).