Open baeminbo opened 1 month ago
Thanks @baeminbo for a detailed repro.
Eventually, we plan to swtich to cloudpickle pickler, which doesn't require saving the main session.
Structuring a pipeline as a package is the best way to avoid having to pass --save_main_session
and can also help provide better structure for complex pipelines. A few examples:
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/dataflow/flex-templates/pipeline_with_dependencies/main.py
What happened?
WriteToBigQuery
withSTORAGE_WRITE_API
method can cause a pickling error [1] ifWriteToBigQuery
step is applied after a multi-output step--save_main_session
isTrue
.See this example of code and run script to reproduce this error.
There are 3 mitigation ways:
WriteToBigQuery
after a single-output step (example).--pickle_library=cloudpickle
.[1]
See the full output at https://gist.github.com/baeminbo/bd23df65e5604cf24213c2e1d6a46a25
Issue Priority
Priority: 3 (minor)
Issue Components