domoritz / arrow-tools

A collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet
Apache License 2.0
155 stars 8 forks source link

csv2parquet failure #89

Closed wynnw closed 5 months ago

wynnw commented 5 months ago

For the latest release (v0.18.0) installed via cargo and building from latest master (last commit june 1 2024), I can't get cvs2parquet to generate data in the parquet file. Here's a trivial example:

simple.csv: a,b 1,a 2,b 3,c 4,d

Run: csv2parquet simple.csv simple.parquet results in an output file that has the schema, but no data. Running csv2parquet -n simple.csv simple.parquet does autodetect a schema and print it out correctly. Using pqrs (installed via cargo) to inspect the file shows the schema with pqrs schema simple.parquet, but there is no actual data in the parquet file. This same pattern happens with the real data with large csv files we were experimenting on.

What am I doing wrong? Or is this a real bug in the release? I see the same thing when I check out the v0.17.0 release commit hash. This is all on amazonlinux2 (which I know is an older platform that is hard to support).

I also tried this with ubuntu 24.04 LTS and got the same behavior.

chrisgeno commented 5 months ago

+1

wynnw commented 5 months ago

I know hardly any rust, but from my primitive attempts to troubleshoot it seems like the for batch in reader loop never iterates over anything, implying something is wrong with the reader initialization/usage? The schema generation loops over all the data but it's got a different implementation there using serde_json.

domoritz commented 5 months ago

Ah, looks like we don't rewind the reader after reading the schema. I will fix it.

domoritz commented 5 months ago

Works now.

$ csv2parquet data/simple.csv simple.parquet && duckdb -c "from simple.parquet"
┌───────┬─────────┐
│   a   │    b    │
│ int64 │ boolean │
├───────┼─────────┤
│    42 │ true    │
│    12 │ false   │
│     7 │ true    │
└───────┴─────────┘
wynnw commented 5 months ago

Thanks for the fix. I can confirm I can generate parquet files using csv2parquet now.