Open andygrove opened 1 week ago
I tried running locally rather than in k8s using ray.init()
to create the cluster. The issue is that we are using too much object store memory. For TPC-H q2 @ 100GB, it consumes all the memory on my workstation (128 GB) and then crashed. I tried limiting object store memory with ray.init(num_cpus=concurrency, object_store_memory=512 * 1024 * 1024)
and it ran longer, but is spilling huge amounts of data to disk and is taking an unreasonable amount of time.
Here is an example where it is spilling a huge amount of data.
(raylet) Spilled 35419 MiB, 1062 objects, write throughput 1534 MiB/s.
Root cause is https://github.com/apache/datafusion-ray/issues/46
I cannot get benchmarks running in k8s. I suspect that too many tasks are being scheduled in parallel.
I added resource constraints in the code:
I am running the benchmark with
My cluster definition is:
I build my image with this Dockerfie, which extends the datafusion-ray image built from the repo.