JuliaData / JuliaDB.jl

Parallel analytical database in pure Julia
http://juliadb.org/
Other
768 stars 62 forks source link

Difficulties for reading a file #122

Open jmathiasHts opened 6 years ago

jmathiasHts commented 6 years ago

Hello, I am beginner in Julia. I try to import the big file of the french establishments (file opendata "sirene": http://files.data.gouv.fr/sirene/sirene_201711_L_M.zip).

I used this code or derivatives codes

Addprocs () Using JuliaDB Path = "F:/BD/labo/labo/siren.csv" sirene = loadtable(path)

And I have mistakes. First, I thought the file was too badly built to be imported via loadtable: The encoding was in WIN-1252 The strings were sometimes contained inside quote, sometimes was not The separator was ";" and no "," The separator could be contained in quoted chains Maybe the file was too big (...?) Maybe the successive separators linked to a missing field might have been misinterpreted, so I replaced ",," with ", NULL," Maybe the values ​​of the non-answers were badly recognized, especially in the numerical fields, I replaced in the numerical variables ", NR, by", NULL, "

So I applied a set of transformations to the initial file using Perl + Iconv regular expressions. I then extracted a small file of 2,500 lines first lines.

This extract can be donwload here : https://www.justbeamit.com/u42je

I did not notice a major flaw when considering this excel extract, and in particular the number of fields in each line is the same and equal to 100.

With sirene=loadtable(path)

julia> sirene=loadtable(chemin) Error parsing F:\BD\labo\labo\test.csv ERROR: On worker 2: previous rows had 98 fields but row 2 has 100 guesscolparsers at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:507

_csvread_internal#35 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:194

_csvread_internal at .\:0

32 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:92

open at .\iostream.jl:152

_csvread_f at .\:0

csvread#34 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:103

csvread at .\:0

_loadtable_serial#2 at C:\Users\jerom.julia\v0.6\JuliaDB\src\util.jl:88

_loadtable_serial at .\:0

217 at C:\Users\jerom.julia\v0.6\JuliaDB\src\io.jl:131

do_task at C:\Users\jerom.julia\v0.6\Dagger\src\compute.jl:319

106 at .\distributed\process_messages.jl:268 [inlined]

run_work_thunk at .\distributed\process_messages.jl:56 macro expansion at .\distributed\process_messages.jl:268 [inlined]

105 at .\event.jl:73

With sirene=loadtable(path,type_detect_rows=2500)

julia> sirene=loadtable(path,type_detect_rows=2500) Error parsing F:\BD\labo\labo\test.csv ERROR: On worker 2: previous rows had 98 fields but row 2 has 100 guesscolparsers at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:507

_csvread_internal#35 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:194

_csvread_internal at .\:0

32 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:92

open at .\iostream.jl:152

_csvread_f at .\:0

csvread#34 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:103

csvread at .\:0

_loadtable_serial#2 at C:\Users\jerom.julia\v0.6\JuliaDB\src\util.jl:88

_loadtable_serial at .\:0

217 at C:\Users\jerom.julia\v0.6\JuliaDB\src\io.jl:131

do_task at C:\Users\jerom.julia\v0.6\Dagger\src\compute.jl:319

106 at .\distributed\process_messages.jl:268 [inlined]

run_work_thunk at .\distributed\process_messages.jl:56 macro expansion at .\distributed\process_messages.jl:268 [inlined]

105 at .\event.jl:73

Do you have any idea of ​​how to correctly load this file? Am I doing it wrong? I chose JuliaDB because of the size of the file to load (~ 8 GB / 11 000 000 lines and 100 variables)

Best regards

shashi commented 6 years ago

I think your file has 98 column headers, but 100 columns?

kirui93 commented 4 years ago

Hey @JeromeM75 , how did you manage to load this data? Did you get a solution?