Difficulty generating local data

mrocklin commented 6 months ago

(benchmarks) benchmarks:~$ python tests/tpch/generate_data.py --scale 10
Scale: 10, Path: tpch-data/scale-10-strict, Partition Size: 128 MiB
Generating TPC-H data
Generating TPC-H data
2024-03-22 12:43:41,696 - distributed.nanny - WARNING - Restarting worker
Generating TPC-H data
Generating TPC-H data
Generating TPC-H data
Generating TPC-H data
2024-03-22 12:43:42,050 - distributed.nanny - WARNING - Restarting worker
Generating TPC-H data
2024-03-22 12:43:42,051 - distributed.nanny - WARNING - Restarting worker
Generating TPC-H data
2024-03-22 12:43:42,067 - distributed.nanny - WARNING - Restarting worker
Generating TPC-H data
Generating TPC-H data
2024-03-22 12:43:42,391 - distributed.nanny - WARNING - Restarting worker
Generating TPC-H data
Generating TPC-H data
Generating TPC-H data
Generating TPC-H data
Generating TPC-H data
Generating TPC-H data
2024-03-22 12:43:42,786 - distributed.nanny - WARNING - Restarting worker
2024-03-22 12:43:42,798 - distributed.nanny - WARNING - Restarting worker
Generating TPC-H data
Generating TPC-H data
2024-03-22 12:43:43,129 - distributed.scheduler - ERROR - Task _tpch_data_gen-38c7b57d4862ed3c06b5d86b41e48aab marked as failed because 4 workers died while trying to run it
2024-03-22 12:43:43,131 - distributed.nanny - WARNING - Restarting worker
2024-03-22 12:43:43,558 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('_tpch_data_gen-ca2a7818e09d1620d908fe93c8cc8d27')" coro=<Worker.execute() done, defined at /Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-03-22 12:43:43,558 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('_tpch_data_gen-563c42036a6cdb159368164534049814')" coro=<Worker.execute() done, defined at /Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
2024-03-22 12:43:46,758 - distributed.nanny - WARNING - Worker process still alive after 3.1999995422363288 seconds, killing

Traceback (most recent call last):
  File "/Users/mrocklin/workspace/benchmarks/tests/tpch/generate_data.py", line 80, in generate
    tpch_from_client(client, **kwargs)
  File "/Users/mrocklin/workspace/benchmarks/tests/tpch/generate_data.py", line 93, in tpch_from_client
    small_tables = client.gather(jobs)
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/distributed/client.py", line 2372, in gather
    return self.sync(
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/distributed/client.py", line 2232, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task '_tpch_data_gen-38c7b57d4862ed3c06b5d86b41e48aab' on 4 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:52238. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mrocklin/workspace/benchmarks/tests/tpch/generate_data.py", line 342, in <module>
    main()
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/mrocklin/workspace/benchmarks/tests/tpch/generate_data.py", line 338, in main
    generate(scale, partition_size, path, relaxed_schema, compression)
  File "/Users/mrocklin/workspace/benchmarks/tests/tpch/generate_data.py", line 85, in generate
    time.sleep(5)
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/distributed/client.py", line 1517, in __exit__
    self.close()
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/distributed/client.py", line 1775, in close
    sync(self.loop, self._close, fast=True, callback_timeout=timeout)
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/distributed/utils.py", line 428, in sync
    raise TimeoutError(f"timed out after {timeout} s.")
asyncio.exceptions.TimeoutError: timed out after 60 s.
2024-03-22 12:44:43,159 - distributed.client - ERROR -
Traceback (most recent call last):
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/asyncio/locks.py", line 226, in wait
    await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/distributed/deploy/spec.py", line 359, in _correct_state_internal
    await asyncio.gather(*tasks)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/distributed/client.py", line 1722, in _close
    await self.cluster.close()
  File "/Users/mrocklin/mambaforge/envs/benchmarks/lib/python3.9/site-packages/distributed/deploy/spec.py", line 448, in _close
    await self._correct_state()
asyncio.exceptions.CancelledError

Not an urgent issue, but somewhat concerning

This is after running installation as in #1493 on a MacBookPro

fjetter commented 6 months ago

I did some digging into this. A couple of observations

The files (by default) are actually a little larger than what the "Partition Size" suggests. I have files that are 200-200MiB each
This translate to about 2-2.5GiB in memory data
I'm running on a M1 with 8 cores and 16GiB of RAM leaving just 2GiB of RAM per thread
We are employing a rather aggressive partition compaction here that aims to fuse partitions based on how much data we cut off using projection / filtering. Beyond the actual memory usage, I saw this pattern causing us to fuse so dramatically that we're not even utilizing all cores necessarily.

So, we're running on relatively large files already and are merging those partitions way too aggressively.

milesgranger commented 6 months ago

To this I also had to sent n_workers to 6 or something to generate scale 100. Didn't look further into it as I had other goals in mind at the time.

Edit, that's also with 1 thread per worker.

fjetter commented 6 months ago

Ah sorry, forget my earlier comment. I was referring to the actual run of the TPCH data. The failures during generation originate from running duckdb on multiple threads which triggers a segfault. just running with one thread does the trick. This is fixed with https://github.com/coiled/benchmarks/pull/1490

coiled / benchmarks

Difficulty generating local data #1496