Open hendrikmakait opened 9 months ago
Installing polars-u64-idx
removed the error, but now it's running OOM even on query 1: https://cloud.coiled.io/clusters/383584/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics&filterPattern=
I'll abort further tests.
@hendrikmakait the u64
error is expected. If you work on more than 4.2 billion rows you need that version of polars. The new string type introduced a bug in our out of core aggregation. Can you retry with Polars 0.20.8? I expect this to release this afternoon.
@ritchie46: Thanks for the additional info, I'll rerun the suite on 0.20.8
. What's the performance difference between polars
and polars-u64-idx
, i.e., would it be fine to run all scales using polars-u64-idx
, or would that significantly skew the results?
Polars still runs OOM on query 1, even with polars-u64-idx=0.20.8
: https://cloud.coiled.io/clusters/384122/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=
Polars still runs OOM on query 1, even with
polars-u64-idx=0.20.8
: https://cloud.coiled.io/clusters/384122/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=
Yeah, I see the query starts from a pyarrow_dataset
. Polars cannot run on the streaming engine with the pyarow dataset. The query should start with scan_parquet
.
So this
def read_data(filename):
pyarrow_dataset = dataset(filename, format="parquet")
return pl.scan_pyarrow_dataset(pyarrow_dataset)
if filename.startswith("s3://"):
import boto3
session = boto3.session.Session()
credentials = session.get_credentials()
return pl.scan_parquet(
filename,
storage_options={
"aws_access_key_id": credentials.access_key,
"aws_secret_access_key": credentials.secret_key,
"region": "us-east-2",
},
)
else:
return pl.scan_parquet(filename + "/*")
should be
def read_data(filename):
return pl.scan_parquet(filename + "/*")
What's the performance difference between polars and polars-u64-idx, i.e., would it be fine to run all scales using polars-u64-idx
The default binary is optimized for smaller dataset. It is slower if you start from disk. I believe you are benchmarking from s3, so I think the difference will be less. But you'll have to try it.
Yeah, I see the query starts from a pyarrow_dataset. Polars cannot run on the streaming engine with the pyarow dataset. The query should start with scan_parquet.
https://github.com/coiled/benchmarks/pull/1394
Can Polars figure out storage options automatically now?
It looks like we fixed the OOM problem, but now Polars appears to be "stuck": https://cloud.coiled.io/clusters/385166/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics&cluster+network_variation=Cluster+Total+Rate
To summarize a few findings: It's not stuck per sé, but it didn't show much hardware utilization and wasn't done after 30 minutes, so I aborted the test. Looking at the hardware metrics, CPU utilization is at ~400% for most of the time, suggesting that it's still doing something, but not a lot. Looking at a run at scale 100, we can see that CPU is at 100% - 200% for most of the time, so maybe our configuration is off?
Scale 100 cluster: https://cloud.coiled.io/clusters/385189/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics
Turns out, if you use the deprecated pl.count it will block the streaming mode. It seems to give us a proper output form explain if we use pl.len, that was very surprising
Switching to scan_parquet
brings its own set of problems: https://github.com/coiled/benchmarks/issues/1396
At scale 1000, Polars fails with
pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())
For now, I'll try manually installing
polars-u64-idx
and re-running the tests. I'll update this issue with my findings.Cluster: https://cloud.coiled.io/clusters/383561/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=
Traceback: