databricks / databricks-sql-python

Databricks SQL Connector for Python
Apache License 2.0
168 stars 94 forks source link

Allow ingesting in-memory file-like objects #435

Open dhirschfeld opened 2 months ago

dhirschfeld commented 2 months ago

Writing large amounts of data to disk, only for databricks-sql-connector to then read it back in from disk, is incredibly inefficient.

It would be much more efficient to be able to provide a file-like object to use instead of a filepath. In that way a user could write the data to an in-memory io.BytesIO object instead of writing the data to disk.

dhirschfeld commented 2 months ago

i.e. allow passing through fh rather than creating it internally by opening a file from the filesystem: https://github.com/databricks/databricks-sql-python/blob/d31063ca918167412153a368c13a99055bf89c02/src/databricks/sql/client.py#L656-L668

kravets-levko commented 2 months ago

Hi @dhirschfeld! This indeed sounds like an intersting feature, thank you for sharing it! I have to talk with the rest of team first. Databricks SQL GET and PUT commands should have local file path specified, but I don't know if we ever considered using streams instead of real files. If we agree that there are no risks with this approach - we would have to implement it across all drivers eventually

susodapop commented 1 month ago

Some added context, @dhirschfeld's idea is exactly how the e2e tests for this feature behave (since we ran them in github actions where we don't have a real file system to write to). Should be a straightforward modification.