Open rashidrafeek opened 3 years ago
I'm not sure what you mean by such files are handled by DelimitedFiles
? or pandas? Last I checked, neither do any kind of auto-detection of delimiters, let alone repeated delimiters? Can you share an example of that?
In terms of adding this new feature, I think it's possible, but might be a little tricky; I guess we'd want to track when we encounter a space, and then how many in a row we find, and then.......what, if there isn't a consistent delimiter, we check if there are a bunch of sequences of spaces? If we can come up with an algorithm, then it shouldn't be too hard to add to the detection.jl file.
Yeah. Here's an example:
julia> using Printf
text = @sprintf "%10s%15s%15s\n" "Column1" "Column2" "Column3"
for i in 1:5
text *= @sprintf "%10d%15.5f%15.3f\n" i i^2/2 sqrt(i)+30
end
open("test.txt","w") do f
write(f,text)
end;
shell> cat test.txt
Column1 Column2 Column3
1 0.50000 31.000
2 2.00000 31.414
3 4.50000 31.732
4 8.00000 32.000
5 12.50000 32.236
DelimitedFiles:
julia> using DelimitedFiles
readdlm("test.txt")
6×3 Matrix{Any}:
"Column1" "Column2" "Column3"
1 0.5 31.0
2 2.0 31.414
3 4.5 31.732
4 8.0 32.0
5 12.5 32.236
Pandas has a custom function to read fixed width files, read_fwf()
:
julia> using PyCall
pd = pyimport("pandas")
pd.read_fwf("test.txt")
PyObject Column1 Column2 Column3
0 1 0.5 31.000
1 2 2.0 31.414
2 3 4.5 31.732
3 4 8.0 32.000
4 5 12.5 32.236
But CSV detects the whole file as many rows with a string as column:
julia> using CSV, DataFrames
a= CSV.read("test.txt",DataFrame)
5×1 DataFrame
Row │ Column1 Column2 Column3
│ String
─────┼──────────────────────────────────────────
1 │ 1 0.50000 31.000
2 │ 2 2.00000 31.414
3 │ 3 4.50000 31.732
4 │ 4 8.00000 32.000
5 │ 5 12.50000 32.236
I tried understanding the logic in detection.jl
. If its okay, I'll try to come up with a PR.
I think we could try to check for the number of columns if we encounter repeated space. And then confirm that the number of columns are same in the rest of the lines we check (by default it is 10 lines I think).
A lot of files I work with are delimited with repeated spaces. (i.e.,
delim=' ', ignorerepeated=true
). It would be great if this case can be handled by the default auto-detect backend only for spaces. Such files is read correctly by the stdlib,DelimitedFiles
and also Pandas.