coiled / dask-snowflake

Dask integration for Snowflake
BSD 3-Clause "New" or "Revised" License
29 stars 7 forks source link

Remove `SerializebleLock` in `to_snowflake` #39

Closed phobson closed 1 year ago

phobson commented 1 year ago

Fixes #29

Just testing things out for now

phobson commented 1 year ago

With the larger dataframe, the partition sizes when we read back from snowflake are coming back a bit erratically sized. This is without any changes to core.py

Locally, I get:


(Pdb) from dask.utils import format_bytes
(Pdb) partition_sizes.map(format_bytes).to_frame("result").assign(expected="2 MiB")
        result expected
0     1.60 MiB    2 MiB
1     1.71 MiB    2 MiB
2     2.18 MiB    2 MiB
3     3.51 MiB    2 MiB
4     1.60 MiB    2 MiB
5     1.71 MiB    2 MiB
6     2.18 MiB    2 MiB
7     4.36 MiB    2 MiB
8     1.39 MiB    2 MiB
9   875.77 kiB    2 MiB
10    1.71 MiB    2 MiB
11    2.18 MiB    2 MiB
12    3.72 MiB    2 MiB
13    1.60 MiB    2 MiB
14    1.70 MiB    2 MiB
15    2.18 MiB    2 MiB
16    1.69 MiB    2 MiB
17    1.28 MiB    2 MiB
18    1.70 MiB    2 MiB
19    2.18 MiB    2 MiB
20    4.37 MiB    2 MiB
21    1.82 MiB    2 MiB
22    1.70 MiB    2 MiB
23    2.18 MiB    2 MiB
24    3.30 MiB    2 MiB
25    1.60 MiB    2 MiB
26    1.71 MiB    2 MiB
27    2.18 MiB    2 MiB
28    4.37 MiB    2 MiB
29    2.79 MiB    2 MiB
30    1.60 MiB    2 MiB
31    1.71 MiB    2 MiB
32    2.18 MiB    2 MiB
33    4.37 MiB    2 MiB
34    1.29 MiB    2 MiB
phobson commented 1 year ago

Push a commit to remove the SerializableLock. The difference in test durations is substantial:

Without 399da1465a208d5ec4459059461ecd4ed0aa104e (status quo):

============================================= slowest 10 durations =============================================
708.74s call     dask_snowflake/tests/test_core.py::test_result_batching_nparitions_and_equality[twelve months]
664.69s call     dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[twelve months]
76.58s call     dask_snowflake/tests/test_core.py::test_result_batching_nparitions_and_equality[one month]
63.95s call     dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[one month]
9.74s call     dask_snowflake/tests/test_core.py::test_application_id_config
8.94s call     dask_snowflake/tests/test_core.py::test_application_id_default
8.27s call     dask_snowflake/tests/test_core.py::test_application_id_explicit
7.54s call     dask_snowflake/tests/test_core.py::test_write_read_roundtrip
7.01s call     dask_snowflake/tests/test_core.py::test_execute_params
6.94s call     dask_snowflake/tests/test_core.py::test_read_empty_result
=========================================== short test summary info ============================================
XFAIL dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[twelve months] - inconsistent partition sizing
=========================== 11 passed, 1 xfailed, 46 warnings in 1601.74s (0:26:41) ============================

With 399da1465a208d5ec4459059461ecd4ed0aa104e (this PR):

============================================= slowest 10 durations =============================================
175.20s call     dask_snowflake/tests/test_core.py::test_result_batching_nparitions_and_equality[twelve months]
115.40s call     dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[twelve months]
27.99s call     dask_snowflake/tests/test_core.py::test_result_batching_nparitions_and_equality[one month]
26.80s call     dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[one month]
7.11s call     dask_snowflake/tests/test_core.py::test_application_id_explicit
7.11s call     dask_snowflake/tests/test_core.py::test_write_read_roundtrip
7.11s call     dask_snowflake/tests/test_core.py::test_application_id_config_on_cluster
6.97s call     dask_snowflake/tests/test_core.py::test_execute_params
6.93s call     dask_snowflake/tests/test_core.py::test_arrow_options
6.71s call     dask_snowflake/tests/test_core.py::test_application_id_default
=========================================== short test summary info ============================================
XFAIL dask_snowflake/tests/test_core.py::test_result_batching_partition_sizes[twelve months] - inconsistent partition sizing
============================ 11 passed, 1 xfailed, 46 warnings in 425.89s (0:07:05) ============================
phobson commented 1 year ago

Similar performance boost in CI:

With Lock: https://github.com/coiled/dask-snowflake/actions/runs/3535816356/jobs/5934242005#step:5:70 Without Lock: https://github.com/coiled/dask-snowflake/actions/runs/3536012799/jobs/5934652801#step:5:70

dchudz commented 1 year ago

@jrbourbeau is this something we should revisit?

fjetter commented 1 year ago

If the threading problem is resolved, I'd recommend removing the parallel=1 kwarg as well. Since everything is IO bound we should be able to insert with many more threads than we have CPUs.

fjetter commented 1 year ago

@jrbourbeau are you still on this?

phobson commented 1 year ago

Heyyy look that it. Finally merged! 🎉

mrocklin commented 1 year ago

You just gotta plant a lotta seeds 🙂