Open cosmicBboy opened 1 year ago
@cosmicBboy the discussion was to implement this entirely in flytekit then? So processing a single StructuredDataset
partition does not necessarily have to align with a map_task
but could be performed with any task - just mapping over the partitions is an ergonomic API?
@hamersaw / @cosmicBboy I feel this is a larger story than just structured dataset. I would love to be able to map over any mappable
entity. For example, a Flat directory of files, a List of values, a Map of key-values, StructuredDataset of partitions etc..
Thus for me, it feels like a materialized trait on the Literal that makes it possible to map over that object
yep, the code example in the description is kinda gross because it special-cases StructuredDataset
. However, if we have a notion of mappable
types in Flyte, with natural (but overrideable ways) of reducing the the results of applying a function via map tasks, that would be 🔥
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏
Motivation: Why do you think this is important?
As a data practitioner, I should be able to apply a
map_task
to a partitioned StructuredDataset automatically so that I can process the partitions in an embarrassingly parallel fashion without too much extra code.Goal: What should the final outcome look like, ideally?
Suppose we have a task that produces a
StructuredDataset
Ideally, I should be able to do something like this:
Note that in this example code a few magical things are happening:
structured_dataset.partitions
into the map task, which indicates that we want to applyprocess_df
to each of the partitions defined inmake_df
map_task(process_df)
returns aStructuredDataset
implies that using map tasks with structured datasets does an implicit reduction, i.e. the outputs ofmap_task(process_df)
are written to the same blob store prefix.Ideally the solution enables processing of
StructuredDataset
without having to manually handle reading in of partitions in the map task, and automatically reduces the results into aStructuredDataset
without having to explicitly write a coalense/reduction task.Describe alternatives you've considered
Users would have to roll their own way of processing partitions of a structured dataset using dynamic tasks.
Propose: Link/Inline OR Additional context
Slack context: https://flyte-org.slack.com/archives/CP2HDHKE1/p1673380243923279
Related to https://github.com/flyteorg/flyte/issues/3219
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?