coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
32 stars 17 forks source link

[TPC-H] Polars does not run at scale 1000 #1389

Open hendrikmakait opened 9 months ago

hendrikmakait commented 9 months ago

At scale 1000, Polars fails with pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())

For now, I'll try manually installing polars-u64-idx and re-running the tests. I'll update this issue with my findings.

Cluster: https://cloud.coiled.io/clusters/383561/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=

Traceback:

2024-02-13 19:04:05.9950
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-3500c860159128597d97e62786e99d95
Function:  _run
args:      (<function test_query_1.<locals>._ at 0x7fe99433e520>)
kwargs:    {}
Exception: 'PanicException("polars\' maximum length reached. Consider installing \'polars-u64-idx\'.: TryFromIntError(())")'
2024-02-13 19:04:05.9360
scheduler

distributed.core - INFO - Event loop was unresponsive in Worker for 33.25s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2024-02-13 19:04:05.9350
scheduler

distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 374.80 GiB -- Worker memory limit: 393.79 GiB
2024-02-13 19:04:02.0430
scheduler

pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())
2024-02-13 19:04:02.0250
scheduler

Traceback (most recent call last):
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/utils/_scan.py", line 28, in _execute_from_rust
    return function(with_columns, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/io/pyarrow_dataset/anonymous_scan.py", line 105, in _scan_pyarrow_dataset_impl
    return from_arrow(ds.to_table(**common_params))  # type: ignore[return-value]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/convert.py", line 594, in from_arrow
    return pl.DataFrame._from_arrow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/dataframe/frame.py", line 599, in _from_arrow
    arrow_to_pydf(
  File "/opt/coiled/env/lib/python3.11/site-packages/polars/utils/_construction.py", line 1546, in arrow_to_pydf
    pydf = pydf.rechunk()
           ^^^^^^^^^^^^^^
2024-02-13 19:04:02.0240
scheduler

--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
2024-02-13 19:04:00.4060
scheduler

thread '<unnamed>' panicked at crates/polars-core/src/chunked_array/ops/chunkops.rs:83:62:
polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
hendrikmakait commented 9 months ago

Installing polars-u64-idx removed the error, but now it's running OOM even on query 1: https://cloud.coiled.io/clusters/383584/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics&filterPattern=

I'll abort further tests.

ritchie46 commented 9 months ago

@hendrikmakait the u64 error is expected. If you work on more than 4.2 billion rows you need that version of polars. The new string type introduced a bug in our out of core aggregation. Can you retry with Polars 0.20.8? I expect this to release this afternoon.

hendrikmakait commented 9 months ago

@ritchie46: Thanks for the additional info, I'll rerun the suite on 0.20.8. What's the performance difference between polars and polars-u64-idx, i.e., would it be fine to run all scales using polars-u64-idx, or would that significantly skew the results?

hendrikmakait commented 9 months ago

Polars still runs OOM on query 1, even with polars-u64-idx=0.20.8: https://cloud.coiled.io/clusters/384122/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=

ritchie46 commented 9 months ago

Polars still runs OOM on query 1, even with polars-u64-idx=0.20.8: https://cloud.coiled.io/clusters/384122/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=

Yeah, I see the query starts from a pyarrow_dataset. Polars cannot run on the streaming engine with the pyarow dataset. The query should start with scan_parquet.

So this

def read_data(filename):
    pyarrow_dataset = dataset(filename, format="parquet")
    return pl.scan_pyarrow_dataset(pyarrow_dataset)

    if filename.startswith("s3://"):
        import boto3

        session = boto3.session.Session()
        credentials = session.get_credentials()
        return pl.scan_parquet(
            filename,
            storage_options={
                "aws_access_key_id": credentials.access_key,
                "aws_secret_access_key": credentials.secret_key,
                "region": "us-east-2",
            },
        )
    else:
        return pl.scan_parquet(filename + "/*")

should be

def read_data(filename):
        return pl.scan_parquet(filename + "/*")

What's the performance difference between polars and polars-u64-idx, i.e., would it be fine to run all scales using polars-u64-idx

The default binary is optimized for smaller dataset. It is slower if you start from disk. I believe you are benchmarking from s3, so I think the difference will be less. But you'll have to try it.

phofl commented 9 months ago

Yeah, I see the query starts from a pyarrow_dataset. Polars cannot run on the streaming engine with the pyarow dataset. The query should start with scan_parquet.

https://github.com/coiled/benchmarks/pull/1394

Can Polars figure out storage options automatically now?

hendrikmakait commented 9 months ago

It looks like we fixed the OOM problem, but now Polars appears to be "stuck": https://cloud.coiled.io/clusters/385166/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics&cluster+network_variation=Cluster+Total+Rate

hendrikmakait commented 9 months ago

To summarize a few findings: It's not stuck per sé, but it didn't show much hardware utilization and wasn't done after 30 minutes, so I aborted the test. Looking at the hardware metrics, CPU utilization is at ~400% for most of the time, suggesting that it's still doing something, but not a lot. Looking at a run at scale 100, we can see that CPU is at 100% - 200% for most of the time, so maybe our configuration is off?

Scale 100 cluster: https://cloud.coiled.io/clusters/385189/information?viewedAccount=%22dask-benchmarks%22&tab=Metrics

phofl commented 9 months ago

Turns out, if you use the deprecated pl.count it will block the streaming mode. It seems to give us a proper output form explain if we use pl.len, that was very surprising

https://github.com/coiled/benchmarks/pull/1395

hendrikmakait commented 9 months ago

Switching to scan_parquet brings its own set of problems: https://github.com/coiled/benchmarks/issues/1396