Open cmadam opened 1 week ago
@blublinsky : these are the hacks that provide us with the desired functionality now: https://github.com/IBM/data-prep-kit/blob/2fb92f78b395ac53563b1885b6b78b2362de1075/data-processing-lib/python/src/data_processing/runtime/pure_python/transform_orchestrator.py#L56-L63 https://github.com/IBM/data-prep-kit/blob/2fb92f78b395ac53563b1885b6b78b2362de1075/data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_orchestrator.py#L100-L106
This is already supported. See https://github.com/IBM/data-prep-kit/blob/dev/data-processing-lib/python/src/data_processing/runtime/pure_python/transform_runtime.py for Python and https://github.com/IBM/data-prep-kit/blob/dev/data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_runtime.py for Spark
For Python it works exactly the way you hacked it. For Spark runtime runs for every partition - not ideal, but will suffice for now. The right thing for Spark will be to read it on the driver and then broadcast it. We can extend https://github.com/IBM/data-prep-kit/blob/dev/data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_runtime.py to do just this. Its a simple enough change.
The runtime class was introduced specifically for this purpose and is already used by a few transforms, for example here https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/ededup/python/src/ededup_transform_python.py#L70 and here https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/spark/src/doc_id_transform_spark.py#L148
Thank you for the pointer. I will use the transform_runtime.py to address this issue.
Search before asking
Component
Library/core
Feature
In the context of filtering, in Spark fuzzy dedup, we need to pass a very large table (millions / billions of rows) of document ids that have been identified as duplicates and need to be removed from the dataset to the transform objects running on each worker. An ideal solution would be to pass these tables as part of the configuration transform objects that are distributed to the transform objects at initialization by the framework. However, I am not aware of the capability to pass a really large, binary object in these configurations.
A similar requirement exists for the block listing transforms, where a large table of restricted domains need to be sent to each transform during initialization.
Are you willing to submit a PR?