Fix/investigate inconsistent partition size in fetching larger datasets

phobson commented 1 year ago

Looks into #40

phobson commented 1 year ago

@jrbourbeau I left the xfail mark on the test parameter, so it "successfully failed" here. Sorry for that confusion.

I'm trying to (roughly) bisect where things fall apart. I'll push that as parameter to the test and remove the xfail (for now)

phobson commented 1 year ago

@jrbourbeau (cc @hayesgb)

I figured out what's going on here. Since we have so little control over the sizes of the batches that snowflake is returning, so of the individual batches are larger than the requested partition size. So the answer to the would be to split up the batch.

Inside PDB during a failing test:

(Pdb) type(batch)
<class 'snowflake.connector.result_batch.ArrowResultBatch'>
(Pdb) print(f"{batch.rowcount=} but {target=}")
batch.rowcount=87977 but target=80565

So that means that I see some possible options here: 1) Try to split up the large batches. That'll be complex because I think we'll have to materialize (fetch) the results to do so. That means, with the way they code is currently written, that we'd have a mix of materialized on non-materialize results. Not ideal, happy to dive into the rabbit whole further if desired.

2) Accept the fact that we don't control the batch sizes, and that will be occasionally wrong for partition sizes smaller than around 5 MiB (based on what I've see so far). We could emit a warning that some partitions are larger than what the user requests and nudge them towards dask.dataframe.repartition docs.

3) Something else I haven't thought, but would be happy to hear about.

phobson commented 1 year ago

Potential third option from me: relax the test a bit. Current the check is:

assert (partition_sizes < 2 * parse_bytes("2 MiB")).all()

So we're already using a fudge factor of 2. Based on what I've noticed digging into this that we could get away with something between 2.2 and 2.5

coiled / dask-snowflake

Fix/investigate inconsistent partition size in fetching larger datasets #41