BlazingDB / blazingsql

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
https://blazingsql.com
Apache License 2.0
1.92k stars 181 forks source link

Implement more scale testing #1461

Open felipeblazing opened 3 years ago

felipeblazing commented 3 years ago

We want to test various sizes but are still just assuming one and two nodes for execution for the time being. I think that we can test on 1GB and 10GB datasets without running into too many scale issues on even very complex queries.

For the various backends and various file formats that we use for storing data we need to upload 1GB and 10GB versions (we already have these in S3 for parquet for example) and have a subset of queries run on these files to make sure that we are still able to run queries at scale.

look at https://github.com/BlazingDB/blazingsql/issues/1460

to see the various places where we will need to be uploading these datasets.

As a start I would pick CSV and Parquet. As the file formats that we make available.

In addition to this someone needs to modify the e2e testing framework so that we can run these kinds of tests specifying the scale and file format that we want to run the scales tests on.