Open mjclarke94 opened 1 year ago
hey @mjclarke94 thanks for the feature request! We're currently prioritize making it easier to use dagster without one of the provided I/O managers (which in turn would make it easier to use snowpark without us needing to write a full integration), but will keep this in mind for future work
In the meantime, if you want to use snowpark to work with your data directly in snowflake, you could look into writing a Resource that would provide a snowpark session to your assets. You could then do your computations in the asset without returning any values, and use deps
to create asset dependencies https://docs.dagster.io/concepts/assets/software-defined-assets#defining-basic-dependencies
I've started dabbling in Snowpark too. My first thought was to create an asset that returned a Snowpark DataFrame. However, when I attempted that, I got an error that _thread.lock
was not pickleable. Not sure if that's the right thing to do here, but it was my first thought.
You might be getting the pickle error because you're still using the default IO manager. The default IO manager tries to pickle all outputs before storing them in the filesystem. You'd need to write an IO manager that can handle storing Snowpark Dataframes
I'm not really sure how we'd write an IO manager for a Snowpark dataframe because it's kind of an ephemeral object. The dataframe doesn't actually store the data (like a Pandas dataframe does). Instead, it's just an object that can be used to compile SQL that ends up being executed in Snowflake. The problem seems to be that these dataframe objects contain the SQL session object as part of them, which are not serializable.
Hey there ! Any news on this ?
What's the use case?
Snowpark is now stable for public usage and seems to be the preferred approach for interacting with snowflake from python. Not only does it allow for complex queries to be composed without the need to manually compose complex SQL queries, but it also offers some performance benefits (based entirely off my non-quantitative experience!)
I think there are a few potential wins here:
Snowpark dataframes can either be converted to pandas dataframes (providing parity with current behaviour albeit a fair bit faster), or operated on directly. When used directly, you effectively get a Lazyframe which can be operated on in various ways, and fed back to the IO manager to be written to a new table. None of this requires the data leave snowflake.
We've had calculations where the operation itself is trivial, but requires huge amounts of data to perform. For these sort of operations, you can knock an order of magnitude off the total materialisation time as you aren't waiting on data transfer and dagster can leverage extremely minimal compute resources for this.
The pythonic API for composing operations on tables is also a lot more readable than manually composing SQL commands. For something like "Fetch the data in this table relating to a given range of time partitions" the query would look something like:
Pseudo-code, but hopefully shows that it is a bit more readable than needing to slowly build up a SQL string as is needed in the IO manager at the moment.
Ideas of implementation
A new IO manager which uses snowpark rather than the snowflake-connector directly would be ideal. The
session
object used at the root of most workflows is analogous to theSnowflakeConnection
used to directly execute custom queries.Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.