GoogleCloudPlatform / dlp-dataflow-deidentification

Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP
Apache License 2.0
89 stars 53 forks source link

Replace File Polling With Pub Sub for GCS #122

Closed Goutam1511 closed 1 year ago

Goutam1511 commented 1 year ago

Currently, the Dataflow Deid Pipeline uses FileIo Watch transform to list existing files and poll for new files in the input GCS folder. This approach has a known Apache Beam issue that causes Out Of Memory(OOM) errors when the pipeline runs for a long time.The recommended solution is to use Pub/Sub notifications from GCS to detect new files instead of using polling. This will eliminate the OOM errors and improve the overall reliability of the pipeline.

Testing Conducted -

Goutam1511 commented 1 year ago

Added few comments @Goutam1511

@dup05 Addressed them. Kindly review and approve if all looks good.