IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
163 stars 110 forks source link

[Feature] Capability to distribute during initialization to a large binary object (e.g.a table) to all the transform instances #608

Open cmadam opened 1 week ago

cmadam commented 1 week ago

Search before asking

Component

Library/core

Feature

In the context of filtering, in Spark fuzzy dedup, we need to pass a very large table (millions / billions of rows) of document ids that have been identified as duplicates and need to be removed from the dataset to the transform objects running on each worker. An ideal solution would be to pass these tables as part of the configuration transform objects that are distributed to the transform objects at initialization by the framework. However, I am not aware of the capability to pass a really large, binary object in these configurations.

A similar requirement exists for the block listing transforms, where a large table of restricted domains need to be sent to each transform during initialization.

Are you willing to submit a PR?

cmadam commented 1 week ago

@blublinsky : these are the hacks that provide us with the desired functionality now: https://github.com/IBM/data-prep-kit/blob/2fb92f78b395ac53563b1885b6b78b2362de1075/data-processing-lib/python/src/data_processing/runtime/pure_python/transform_orchestrator.py#L56-L63 https://github.com/IBM/data-prep-kit/blob/2fb92f78b395ac53563b1885b6b78b2362de1075/data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_orchestrator.py#L100-L106

blublinsky commented 6 days ago

This is already supported. See https://github.com/IBM/data-prep-kit/blob/dev/data-processing-lib/python/src/data_processing/runtime/pure_python/transform_runtime.py for Python and https://github.com/IBM/data-prep-kit/blob/dev/data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_runtime.py for Spark

For Python it works exactly the way you hacked it. For Spark runtime runs for every partition - not ideal, but will suffice for now. The right thing for Spark will be to read it on the driver and then broadcast it. We can extend https://github.com/IBM/data-prep-kit/blob/dev/data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_runtime.py to do just this. Its a simple enough change.

The runtime class was introduced specifically for this purpose and is already used by a few transforms, for example here https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/ededup/python/src/ededup_transform_python.py#L70 and here https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/spark/src/doc_id_transform_spark.py#L148

cmadam commented 6 days ago

Thank you for the pointer. I will use the transform_runtime.py to address this issue.