apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.46k stars 185 forks source link

tpch conversion is failling with ArrowError(CsvError) #732

Open zhzy0077 opened 1 year ago

zhzy0077 commented 1 year ago

Which issue does this PR close?

This is - to me - a utility change, so I don't create a bug. Closes #.

Rationale for this change

benchmarks/src/bin/tpch.rs is failing to convert TPC-H's tbl file to parquet file. because there's a tailing '|' in the tbl file like this:

1|goldenrod lavender spring chocolate lace|Manufacturer#1|Brand#13|PROMO BURNISHED COPPER|7|JUMBO PKG|901.00|ly. slyly ironi|

So Arrow believes the schema doesn't match CSV itself and throws Error: ArrowError(CsvError("incorrect number of fields for line 1, expected 9 got more than 9"))

What changes are included in this PR?

Adds an null-able trailing field to the schema so CSV parsing won't fail.

Are there any user-facing changes?

No

XiangpengHao commented 1 year ago

Strange that the same benchmark in DataFusion works fine. https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/tpch.rs#L69