Open NellyWhads opened 1 day ago
Hi @NellyWhads, thanks for raising this issue! I'll probably take a closer look later this week to see what it would take to make this happen. Our initial impression is that this should be very doable. If you'd like to work on this we can also give you detailed pointers once we've taken that closer look.
Thanks @desmondcheongzx - I'm not particularly tied to having my name on the implementation. Though I have a working familiarity with ray
, I imagine the active maintainers here may have some more experience and insight. If you're willing to implement a fully working solution, I'd be equally as appreciative!
Is your feature request related to a problem?
The current APIs to convert a
daft.DataFrame
to aray.Dataset
include:daft.DataFrame.to_ray_dataset daft.DataFrame.write_parquet -> Load directly using ray.data.read_parquet daft.DataFrame.to_torch_map_dataset daft.DataFrame.to_torch_iter_dataset
All of these implementations require materialization of the entire dataframe. This is of particular concern for large datasets with images and other dense objects which cannot fit into system/ray object memory.
Describe the solution you'd like
I would like to see a
ray.data.Datasource
which creates aray.data.Dataset
instead of aray.data.MaterializedDataset
. The implementation should lazily-reference the appropriate block/partition of the dataframe and materialize it.I'm more than willing to continue trying things out, but need some guidance from the devs on what may be a reasonable next step.
Describe alternatives you've considered
See links in the description.
The only "working" alternative solution I have so far is the following, rather crude, implementation:
Additional Context
I have found the following limitations after a few implementation attempts:
parallelism
configuration to set the partitions on the dataframe in a meaningful way. If the dataframe was not partitioned before applying a set of logical operations, the posthumous partitioning call will have no practical effect. This makes a lot of sense as implemented, since global operations likegroupby
would need access to the entire dataframe at a time. I don't know how this could be circumvented.df._builder
(perhaps other properties as well) cannot be serialized to aray
object ref. This prevent us from passing a dataframe into the constructor of aray.data.Datasource
.RayMaterializedResult
can only be produced from agenerator
.ray.data.Datasource.get_read_tasks()
needs to create alist
of tasks, which thereby requires a call tolist(daft.context.get_context().runner().run_iter(df._builder))
, which effectively materializes the entire dataframe.Scheduler
doesn't seem to provide a way listing/accessing individual threads. This leads to the same limitation as above.Would you like to implement a fix?
Yes