coiled / dask-snowflake

Dask integration for Snowflake
BSD 3-Clause "New" or "Revised" License
29 stars 7 forks source link

Result batching #34

Closed ian-r-rose closed 1 year ago

ian-r-rose commented 1 year ago

Fixes #7. This is ready for some eyes.

There is an API-design question in particular that I'd like to call out. I've chosen to model the new keyword arguments around those of DataFrame.repartition, so the user has to provide exactly one of npartitions or partition_size. This requires some thought from the user of read_snowflake, as I have not provided defaults here. Other options here could be

  1. Provide a default like partition_size=100 MiB or similar so the user doesn't have to specify, but might be the wrong choice in some circumstances. This is also and is different behavior from DataFrame.repartition.
  2. Fall back on the snowflake-provided result batches, so one batch -> one partition. I don't really like that, as we don't control their batching, and they are typically much smaller than we want, so it would wind up being a poor default in most circumstances.

Update: went with option 1

mrocklin commented 1 year ago

Let's ask @mdwgrogan for his thoughts on API spelling

For defaults if the current Snowflake default is megabyte batching then I think we should definitely provide a more sane default. I'm in favor of 100MB.

I'm curious, will we (should we?), get pyarrow strings out by default? If the answer is "we shouldn't do this yet" then I'll suggest 20-50MB batches to be a bit more conservative due to the python object dtype expansion. Or we could look at one partition, expand it a bit, and apply a sizing factor.

ian-r-rose commented 1 year ago

Let's ask @mdwgrogan for his thoughts on API spelling

For defaults if the current Snowflake default is megabyte batching then I think we should definitely provide a more sane default. I'm in favor of 100MB.

Makes sense, I'll make this change for now.

I'm curious, will we (should we?), get pyarrow strings out by default? If the answer is "we shouldn't do this yet" then I'll suggest 20-50MB batches to be a bit more conservative due to the python object dtype expansion. Or we could look at one partition, expand it a bit, and apply a sizing factor.

Currently we don't get pyarrow strings by default, but I believe this is possible to do here without too much effort. From my perspective, we should probably get a dask release or two under our belts with better pyarrow string support before pushing it in downstream projects.

mrocklin commented 1 year ago

Maybe if we're using python strings (makes sense) then maybe 20-50MB makes sense to be a bit more conservative?

ian-r-rose commented 1 year ago

Maybe if we're using python strings (makes sense) then maybe 20-50MB makes sense to be a bit more conservative?

In the current implementation, I'm basing the size on an actual pandas dataframe with object dtypes, so I'm not too worried about size inflation. Unless I misunderstand your point?

mrocklin commented 1 year ago

Oh great. NM then

On Tue, Nov 1, 2022 at 3:18 PM Ian Rose @.***> wrote:

Maybe if we're using python strings (makes sense) then maybe 20-50MB makes sense to be a bit more conservative?

In the current implementation, I'm basing the size on an actual pandas dataframe with object dtypes, so I'm not too worried about size inflation. Unless I misunderstand your point?

— Reply to this email directly, view it on GitHub https://github.com/coiled/dask-snowflake/pull/34#issuecomment-1299088323, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTB7M7DHGAFWVAT3S7LWGF3JNANCNFSM6AAAAAARUMEQWI . You are receiving this because you commented.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO