HGInsights / avalanche

Avalanche is an Elixir Snowflake Connector built on top of the Snowflake SQL API v2.
Apache License 2.0
17 stars 4 forks source link

Streaming support #51

Open bglusman opened 3 months ago

bglusman commented 3 months ago

Describe the solution you'd like I'm open to helping to contribute to this, but wanted to discuss first and make sure I'm not missing a problem or built in support I didn't find in docs... since Snowflake sends responses in chunks over HTTP API that Avalanche uses anyway, it would be nice if we could build a stream and start sending the response data to the stream as soon as we get first bytes back, to reduce memory usage and latency. I imagine this might, interface wise, be as simple as Avalanche.stream in addition to current Avalanche.run with ~ the same API?

Describe alternatives you've considered Considered trying to build this myself out of public or private APIs in Avaalanche, but haven't started playing yet.

Additional context Some snowflake queries are extremely slow and/or return extremely large responses, the large responses have created memory issues in the past for other teams we're trying to avoid.

bglusman commented 3 months ago

@forest happy to discuss here or in Elixir Slack sometime this week if you have a few minutes! I looked briefly into what it would take to add, and for a minute it looked like it might be trivial, either by just not converting the stream to a list in Avalanche.Steps.GetPartitions.get_partitions/2 or perhaps by using the built in support of Req library to accumulate data into a stream via into option... but I haven't actually validated whether either approach is viable, might just try playing around to see if I can make it work this week, but would love your input when you have a moment.

forest commented 3 months ago

@bglusman we would love to have streaming and happy to take PRs. I looked at this in the initial development but Req didn't support streaming yet. So, first step I would suggest it to update Req to the latest version. Then add the streaming capability. I think adding Avalanche.stream might be the cleanest. See what feels right to you.

One thing to consider is that Snowflake returns data in partitions of sizes and number you can't control. See these docos for the response flow and how the polling works. You will need to make sure you get streams from each partition and add them together in the right order. https://docs.snowflake.com/en/developer-guide/sql-api/handling-responses

Happy to support your efforts. Ping me on Elixir slack to discuss more.

forest commented 3 months ago

@bglusman I was looking at the code and I think it might be easiest to create a new Req step for streaming. Copy Avalanche.Steps.GetPartitions and create Avalanche.Steps.StreamPartitions. Then in @run_options_schema add an option to turn on streaming. Use that option when you build the Req pipeline here https://github.com/HGInsights/avalanche/blob/main/lib/avalanche/requests/statement_request.ex#L86 to conditionally attach Steps.GetPartitions or Steps.StreamPartitions.

Just an idea.

bglusman commented 2 months ago

Apologies I haven't responded here since May, still hoping to have time to work on this "real soon now", but unclear priority at the moment for $DAYJOB and not much freetime at the moment... but appreciate the guidance and hope we can get this in eventually! If anyone else is interested of course speak up here, @jeremyowensboggs was also interested I know, and/or maybe @notactuallytreyanastasio