Closed marklit closed 2 years ago
Hey @marklit! Thanks for the report, but this doesn't look like a file size issue.
This looks like a bug in the parquet datasource which is experimental right now. Based on my reading of the library underneath, it uses hand-written assembly for some elements of the implementation in order to use AVX2 instructions and it looks like those crash on your machine.
Please try building OctoSQL in the following way:
CGO_ENABLED=0 go build --tags purego
OCTOSQL_NO_TELEMETRY=1 ./octosql "SELECT * FROM trips.parquet LIMIT 10"
This should use a pure Go implementation of the aforementioned instructions and shouldn't crash.
Please let me know if that works.
@marklit You should also be able to use the official release I've just created as I've added this build tag there as well.
You're right. The VM I had setup had AVX2 disabled for some reason. I got 10 records to return with your official build in 4 seconds.
I re-configured my VM to support AVX2 but still got the above error with running go run main.go
.
Running your official build again on Q1 from my benchmark suite brought back an Index out of range error.
$ ./octosql "SELECT cab_type, count(*) FROM trips.parquet GROUP BY cab_type"
panic: runtime error: index out of range [825241904] with length 19994
goroutine 1 [running]:
github.com/segmentio/parquet-go.(*byteArrayDictionary).Index(0x15be5c0?, 0x276110?)
/home/runner/go/pkg/mod/github.com/cube2222/parquet-go@v0.0.0-20220512155810-0e06eee50261/dictionary.go:86 +0xed
github.com/segmentio/parquet-go.(*indexedPageReader).ReadValues(0xc018185950, {0xc016f37e80, 0xaa, 0xc000075f20?})
/home/runner/go/pkg/mod/github.com/cube2222/parquet-go@v0.0.0-20220512155810-0e06eee50261/dictionary.go:338 +0x89
github.com/segmentio/parquet-go.(*optionalPageReader).ReadValues(0xc0181b33c0, {0xc016f37e80, 0xaa, 0xaa})
/home/runner/go/pkg/mod/github.com/cube2222/parquet-go@v0.0.0-20220512155810-0e06eee50261/page.go:382 +0x14a
github.com/segmentio/parquet-go.(*columnChunkReader).readValuesFromCurrentPage(0xc0001e1300)
/home/runner/go/pkg/mod/github.com/cube2222/parquet-go@v0.0.0-20220512155810-0e06eee50261/column_chunk.go:135 +0x90
github.com/segmentio/parquet-go.(*columnChunkReader).readValues(0x32?)
/home/runner/go/pkg/mod/github.com/cube2222/parquet-go@v0.0.0-20220512155810-0e06eee50261/column_chunk.go:115 +0x29
github.com/segmentio/parquet-go.columnReadRowFuncOfLeaf.func1({0x0?, 0x0, 0x0}, 0x0?, {0xc0001e0a00, 0x0?, 0x0?})
/home/runner/go/pkg/mod/github.com/cube2222/parquet-go@v0.0.0-20220512155810-0e06eee50261/column_chunk.go:326 +0xc5
github.com/segmentio/parquet-go.makeColumnReadRowFunc.func1({0x0?, 0x3?, 0x0?}, 0x0?, {0xc0001e0a00, 0x35, 0x35})
/home/runner/go/pkg/mod/github.com/cube2222/parquet-go@v0.0.0-20220512155810-0e06eee50261/schema.go:163 +0xa3
github.com/segmentio/parquet-go.(*rowGroupRowReader).ReadRow(0x0?, {0x0?, 0x0, 0x0?})
/home/runner/go/pkg/mod/github.com/cube2222/parquet-go@v0.0.0-20220512155810-0e06eee50261/row_group.go:306 +0xb7
github.com/segmentio/parquet-go.(*reader).ReadRow(0xc00012e0a0, {0x0?, 0x0, 0x0?})
/home/runner/go/pkg/mod/github.com/cube2222/parquet-go@v0.0.0-20220512155810-0e06eee50261/reader.go:276 +0xb1
github.com/segmentio/parquet-go.(*Reader).ReadRow(0xc00012e090, {0x0, 0x0, 0x0})
/home/runner/go/pkg/mod/github.com/cube2222/parquet-go@v0.0.0-20220512155810-0e06eee50261/reader.go:221 +0x65
github.com/cube2222/octosql/datasources/parquet.(*DatasourceExecuting).Run(0xc0000f2fc0, {{0xf5f168?, 0xc0001e2dc0?}, 0x0?}, 0xc0000f30e0, 0x1?)
/home/runner/work/octosql/octosql/datasources/parquet/execution.go:47 +0x512
github.com/cube2222/octosql/execution/nodes.(*SimpleGroupBy).Run(0xc000075f20, {{0xf5f168?, 0xc0001e2dc0?}, 0x0?}, 0xc0000f30b0, 0xc0000f3050)
/home/runner/work/octosql/octosql/execution/nodes/simple_group_by.go:38 +0x228
github.com/cube2222/octosql/execution/nodes.(*Map).Run(0xc0000f2ff0, {{0xf5f168?, 0xc0001e2dc0?}, 0x0?}, 0xc0000f3080, 0x0?)
/home/runner/work/octosql/octosql/execution/nodes/map.go:23 +0xfc
github.com/cube2222/octosql/execution/nodes.(*Limit).Run(0xc000031720, {{0xf5f168?, 0xc0001e2dc0?}, 0x0?}, 0xc0000f3020, 0xc00011c700?)
/home/runner/work/octosql/octosql/execution/nodes/limit.go:34 +0x3a6
github.com/cube2222/octosql/outputs/batch.(*OutputPrinter).Run(0xc00011c700, {{0xf5f168?, 0xc0001e2dc0?}, 0x0?})
/home/runner/work/octosql/octosql/outputs/batch/live_output.go:81 +0x396
github.com/cube2222/octosql/cmd.glob..func4(0x156ba00, {0xc00009e0b0, 0x1, 0x1?})
/home/runner/work/octosql/octosql/cmd/root.go:463 +0x3653
github.com/spf13/cobra.(*Command).execute(0x156ba00, {0xc000030070, 0x1, 0x1})
/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:856 +0x67c
github.com/spf13/cobra.(*Command).ExecuteC(0x156ba00)
/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:902
github.com/spf13/cobra.(*Command).ExecuteContext(...)
/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:895
github.com/cube2222/octosql/cmd.Execute({0xf5f168?, 0xc0001e2dc0?})
/home/runner/work/octosql/octosql/cmd/root.go:476 +0x53
main.main()
/home/runner/work/octosql/octosql/main.go:24 +0xe8
@marklit Could you please describe exactly how you created the trips table in ClickHouse? That would help immensely with debugging this issue (and creating a sensible issue for the parquet library authors).
I imported the CSV files from my 1.1B taxi rides dataset into a Log engine table in ClickHouse.
$ clickhouse-client
CREATE TABLE trips (
trip_id UInt32,
vendor_id String,
pickup_datetime DateTime,
dropoff_datetime Nullable(DateTime),
store_and_fwd_flag Nullable(FixedString(1)),
rate_code_id Nullable(UInt8),
pickup_longitude Nullable(Float64),
pickup_latitude Nullable(Float64),
dropoff_longitude Nullable(Float64),
dropoff_latitude Nullable(Float64),
passenger_count Nullable(UInt8),
trip_distance Nullable(Float64),
fare_amount Nullable(Float32),
extra Nullable(Float32),
mta_tax Nullable(Float32),
tip_amount Nullable(Float32),
tolls_amount Nullable(Float32),
ehail_fee Nullable(Float32),
improvement_surcharge Nullable(Float32),
total_amount Nullable(Float32),
payment_type Nullable(String),
trip_type Nullable(UInt8),
pickup Nullable(String),
dropoff Nullable(String),
cab_type Nullable(String),
precipitation Nullable(Int8),
snow_depth Nullable(Int8),
snowfall Nullable(Int8),
max_temperature Nullable(Int8),
min_temperature Nullable(Int8),
average_wind_speed Nullable(Int8),
pickup_nyct2010_gid Nullable(Int8),
pickup_ctlabel Nullable(String),
pickup_borocode Nullable(Int8),
pickup_boroname Nullable(String),
pickup_ct2010 Nullable(String),
pickup_boroct2010 Nullable(String),
pickup_cdeligibil Nullable(FixedString(1)),
pickup_ntacode Nullable(String),
pickup_ntaname Nullable(String),
pickup_puma Nullable(String),
dropoff_nyct2010_gid Nullable(UInt8),
dropoff_ctlabel Nullable(String),
dropoff_borocode Nullable(UInt8),
dropoff_boroname Nullable(String),
dropoff_ct2010 Nullable(String),
dropoff_boroct2010 Nullable(String),
dropoff_cdeligibil Nullable(String),
dropoff_ntacode Nullable(String),
dropoff_ntaname Nullable(String),
dropoff_puma Nullable(String)
) ENGINE = Log;
$ for FILENAME in trips_x*.csv.gz; do
gunzip -c $FILENAME | clickhouse-client --query="INSERT INTO trips FORMAT CSV"
done
I've tried running parquet-tools on the .pq file but it exhausts my systems RAM. I'm not sure if trying to make 100 GB+ PQ files work with this system is a better path than trying to make multiple files in a single folder work. 10 GB+ PQ files have a surprising number of problems with most tooling that supports PQ.
Hey! Thanks for the details! I'll investigate them once I have some time.
On my Ubuntu 20 system with 16 GB of RAM and 2 TB of disk capacity, I installed GoLang:
I then cloned the master branch of OctoSQL from today.
I then dumped a 1.1B-row, 116 GB in Snappy-compressed Parquet file from the latest version of ClickHouse.
When I ran the SQL below I got the following error.