Closed wynnw closed 5 months ago
+1
I know hardly any rust, but from my primitive attempts to troubleshoot it seems like the for batch in reader
loop never iterates over anything, implying something is wrong with the reader initialization/usage? The schema generation loops over all the data but it's got a different implementation there using serde_json.
Ah, looks like we don't rewind the reader after reading the schema. I will fix it.
Works now.
$ csv2parquet data/simple.csv simple.parquet && duckdb -c "from simple.parquet"
┌───────┬─────────┐
│ a │ b │
│ int64 │ boolean │
├───────┼─────────┤
│ 42 │ true │
│ 12 │ false │
│ 7 │ true │
└───────┴─────────┘
Thanks for the fix. I can confirm I can generate parquet files using csv2parquet now.
For the latest release (v0.18.0) installed via cargo and building from latest master (last commit june 1 2024), I can't get cvs2parquet to generate data in the parquet file. Here's a trivial example:
simple.csv:
a,b 1,a 2,b 3,c 4,d
Run:
csv2parquet simple.csv simple.parquet
results in an output file that has the schema, but no data. Runningcsv2parquet -n simple.csv simple.parquet
does autodetect a schema and print it out correctly. Using pqrs (installed via cargo) to inspect the file shows the schema withpqrs schema simple.parquet
, but there is no actual data in the parquet file. This same pattern happens with the real data with large csv files we were experimenting on.What am I doing wrong? Or is this a real bug in the release? I see the same thing when I check out the v0.17.0 release commit hash. This is all on amazonlinux2 (which I know is an older platform that is hard to support).
I also tried this with ubuntu 24.04 LTS and got the same behavior.