coiled / dask-snowflake

Dask integration for Snowflake
BSD 3-Clause "New" or "Revised" License
29 stars 7 forks source link

specifying a partition size breaks down with larger datasets #40

Open phobson opened 1 year ago

phobson commented 1 year ago

In #39, one of the tests I added uses a 12-month dataset instead of a 1-month dataset. When we try to fetch the data in 2 MiB partition, we're generally successful in with the smaller dataset. But things are consistently wrong in both directions with the larger dataset.

Copy/pasted from here:

N.B. -- the check that we perform is comparing actual partition sizes to 2x the requested partition size.

(Pdb) from dask.utils import format_bytes
(Pdb) partition_sizes.map(format_bytes).to_frame("result").assign(expected="2 MiB")
        result expected
0     1.60 MiB    2 MiB
1     1.71 MiB    2 MiB
2     2.18 MiB    2 MiB
3     3.51 MiB    2 MiB
4     1.60 MiB    2 MiB
5     1.71 MiB    2 MiB
6     2.18 MiB    2 MiB
7     4.36 MiB    2 MiB
8     1.39 MiB    2 MiB
9   875.77 kiB    2 MiB
10    1.71 MiB    2 MiB
11    2.18 MiB    2 MiB
12    3.72 MiB    2 MiB
13    1.60 MiB    2 MiB
14    1.70 MiB    2 MiB
15    2.18 MiB    2 MiB
16    1.69 MiB    2 MiB
17    1.28 MiB    2 MiB
18    1.70 MiB    2 MiB
19    2.18 MiB    2 MiB
20    4.37 MiB    2 MiB
21    1.82 MiB    2 MiB
22    1.70 MiB    2 MiB
23    2.18 MiB    2 MiB
24    3.30 MiB    2 MiB
25    1.60 MiB    2 MiB
26    1.71 MiB    2 MiB
27    2.18 MiB    2 MiB
28    4.37 MiB    2 MiB
29    2.79 MiB    2 MiB
30    1.60 MiB    2 MiB
31    1.71 MiB    2 MiB
32    2.18 MiB    2 MiB
33    4.37 MiB    2 MiB
34    1.29 MiB    2 MiB

I'll start a PR to investigate this further.