Open avriiil opened 1 week ago
copying over this comment from @jaychia from Slack
I’m interested in figuring out the API. Was thinking we could leverage .url.download()
like how we do for S3 URLs, but the tricky thing to try and make work is the credentials story as well as the URL scheme.
Maybe we can recognize URLs like unity://<volume_name>/...
and perform a 2-step process:
.url.download()
on the S3 URLs using those S3 credentialsWhat do you think instead of:
volume = unity.load_volume("unity.default.images")
df = daft.from_unity_catalog_volume(volume)
# Shows references, file sizes etc
df.show()
df_img = df.with_column("image_data_bytes", df["references"].unity_catalog.download(volume))
df_img = df.with_column("image", df["image_data_bytes"].image.decode())
I like the simplicity of this.
My brain hesitates for a minute at the idea of putting unstructured data into a DataFrame...but then I guess that's the basic premise of Daft :) We should document clearly that daft.from_uc_volume
creates a DataFrame containing only the metadata.
JSON is the other big use-case here. So that would look something like:
#create df with multiple json refs
df_json = df.with_column("json", df["references"].unity_catalog.download(volume))
#perform json expressions on content
df_json = df.with_column("json_query", df["json"].json.query(...))
What if we want to take a single JSON stored in UC and turn it into a dataframe? We'd need something like:
volume = unity.load_volume("unity.default.json")
uc_json = volume.get(path=<path/to/json>)
df = daft.read_json(uc_json)
Or do we have users create the full df
containing all the files first + then selectively filter the row(s) they need to operate on further?
It would be great to start building out Volume support from Daft for Unity Catalog. Images and JSON feel like the highest prio to start with.
Right now Table supports looks like this:
It would be cool to be able to do something like this (pseudo-code):
Of course, whatever API we design needs to be able to handle many different data types. Perhaps it makes more sense to introduce a sublevel per dtype, e.g.
.uc_volume.image.download()
.uc_volume.json.quey()
In this case we could leverage existing methods/expressions in the
url
,image
andjson
modules.Curious to hear what other folks think.