Open prakhar-dhakar opened 8 months ago
A Quick Fix for this can be the following change in the python-SDK in the following function apache_beam.dataframe.transforms.DataframeTransform.expand
def expand(self, input_pcolls):
# Avoid circular import.
from apache_beam.dataframe import convert
# Convert inputs to a flat dict.
input_dict = _flatten(input_pcolls) # type: Dict[Any, PCollection]
proxies = _flatten(self._proxy) if self._proxy is not None else {
tag: None
for tag in input_dict
}
input_frames = {
k: convert.to_dataframe(pc, proxies[k])
for k, pc in input_dict.items()
} # type: Dict[Any, DeferredFrame] # noqa: F821
The issue is occuring because of the label 'None' being passed to convert.to_dataframe(), this causes the label to be fetched from variable name, and the variable name for all PCollections is being retrieved as pc
, hence pipeline fails due to transforms with duplicate names.
If we pass the label str(k))
while calling the function, this issue can be resolved as shown below.
k: convert.to_dataframe(pc, proxies[k], str(k))
This is a very quick fix that can result in the pipeline being used as intended.
Hey @prakhar-dhakar thanks for the report and the suggested fix. I agree that is a reasonable approach. Would you be open to creating a pull request with your suggested fix?
Sure, i will create a pull request with the changes
What happened?
To reproduce the issue, you can simply use the below python code
this is sourced from this https://stackoverflow.com/questions/70937308/apache-beam-multiple-pcollection-dataframetransform-issue issue.
To explain the issue in detail,
When we are passing multiple PCollection to
DataframeTransform
function, we get the following error backRuntimeError: A transform with label "TransformedDF/BatchElements(pc)" already exists in the pipeline
.This is happening even though the PCollection is schema aware
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components