Improve performance of concatenation

mrocklin commented 1 year ago

Users observed a 10% diff. I've pinged you on the issue. Not a big deal, this shouldn't block a release, but it might not be hard to address either.

phobson commented 1 year ago

Interesting error to get here:

 =========================== short test summary info ===========================
FAILED dask_snowflake/tests/test_core.py::test_result_batching - pyarrow.lib.ArrowInvalid: Schema at index 1 was different: 
NAME: string
ID: int64
X: double
Y: double
vs
NAME: string
ID: int16
X: double
Y: double

mrocklin commented 1 year ago

Yup. I saw. I suspect that concat rules are a little different regarding dtypes. I haven't investigated nor do I plan to in the near future.

If anyone else wants to pick this up and carry it over the line that would be grand. My guess is that this means looking more deeply at the dtypes that everything is returning and how the different functions handle them, determining what sensible behavior is, and then seeing if what we're doing is sensible.

On Tue, Nov 8, 2022 at 12:53 PM Paul Hobson @.***> wrote:

Interesting error to get here:

=========================== short test summary info =========================== FAILED dask_snowflake/tests/test_core.py::test_result_batching - pyarrow.lib.ArrowInvalid: Schema at index 1 was different: NAME: string ID: int64 X: double Y: double vs NAME: string ID: int16 X: double Y: double

— Reply to this email directly, view it on GitHub https://github.com/coiled/dask-snowflake/pull/35#issuecomment-1307682143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTGNATG7ZD3EU5XAJTLWHKOSVANCNFSM6AAAAAARZJIZIM . You are receiving this because you were mentioned.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO

jrbourbeau commented 1 year ago

Thanks for digging in @phobson. It's not clear to me why we would be getting different schemas for table batches. Possibly a snowflake bug (at least naively I would expect to get a consistent schema when breaking a single table up into batches). @phobson are you aware of anything we can do on the snowflake side to get (or cast to) a uniform schema? For example, could we get the overall table schema and then convert each result batch to that schema? Note would probably reduce the motivation for adding this optimization in the first place

Also, if you're up for it, would you mind opening up an issue in the snowflake connector issue tracker for this? (if there's not already one open)

mrocklin commented 1 year ago

My guess is that snowflake is handing us int16s, pyarrow is concatting those to an int16, but that pandas is more conservative and concats to int64.

On Tue, Nov 8, 2022 at 1:09 PM James Bourbeau @.***> wrote:

Thanks for digging in @phobson https://github.com/phobson. It's not clear to me why we would be getting different schemas for table batches. Possibly a snowflake bug (at least naively I would expect to get a consistent schema when breaking a single table up into batches). @phobson https://github.com/phobson are you aware of anything we can do on the snowflake side to get (or cast to) a uniform schema? For example, could we get the overall table schema and then convert each result batch to that schema? Note would probably reduce the motivation for adding this optimization in the first place

Also, if you're up for it, would you mind opening up an issue in the snowflake connector issue tracker for this? (if there's not already one open)

— Reply to this email directly, view it on GitHub https://github.com/coiled/dask-snowflake/pull/35#issuecomment-1307703881, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTESKQKUHEAVEEYDIGDWHKQP7ANCNFSM6AAAAAARZJIZIM . You are receiving this because you were mentioned.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO

jrbourbeau commented 1 year ago

I believe the error is coming from the pa.concat_tables(...) call where we attempt to concat pyarrow.Tables that have different schemas -- we error before pandas comes into play. I'm not sure if the schema mismatch already exists in the result batches snowflake gives us, or if it comes up when we convert to pyarrow.Tables. @phobson could you insert breakpoint() and see what the earliest point is that we have a mismatch?

phobson commented 1 year ago

@jrbourbeau it looks like the second through _N_th partitions are the offending ones.

Using test ddf from the test suite, I see:

all(p["ID"].dtype == "int64" for p in ddf.partitions)  # True

# but but but....
from dask_snowflake.core import _fetch_query_batches
batches = _fetch_query_batches(f"SELECT * FROM {table}", connection_kwargs, {}).compute()
for n, b in enumerate(batches, 1):
    print(n, b.to_pandas()["ID"].dtype)

# 1 int64
# 2 int16
# 3 int16
# [snip]
# 18 int16

We set meta with the first (non-empty) partition. But it's unclear to me why the rest are different: https://github.com/coiled/dask-snowflake/blob/1a3cc8a191860bdea0a96f029366f7693f2e237c/dask_snowflake/core.py#L279-L281

What's confounding me further is that the raw schema for the ID column in each batch (type: ArrowChunk) is the same:

all(batches[0].schema[1] == b.schema[1] for b in batches) # True

phobson commented 1 year ago

This bafflingly working for me locally now. Kicked to the CI to see what happens

phobson commented 1 year ago

Figured it out. We end up (somehow) with an empty chunk in the test that's failing. Filtering out the empty chunks get the test to pass. Hope you don't mind that I pushed to your PR, @mrocklin

coiled / dask-snowflake

Improve performance of concatenation #35