Currently, the Dataflow Deid Pipeline uses FileIo Watch transform to list existing files and poll for new files in the input GCS folder. This approach has a known Apache Beam issue that causes Out Of Memory(OOM) errors when the pipeline runs for a long time.The recommended solution is to use Pub/Sub notifications from GCS to detect new files instead of using polling. This will eliminate the OOM errors and improve the overall reliability of the pipeline.
Testing Conducted -
Tested without --GCSNotificationTopic and --processExistingFiles=false flags
Tested with only --GCSNotificationTopic flag
Tested with default --processExistingFlags set to true
Tested with both flags true
Long running experiment started to capture throughput and memory improvements. Results will be captured internally.
Currently, the Dataflow Deid Pipeline uses FileIo Watch transform to list existing files and poll for new files in the input GCS folder. This approach has a known Apache Beam issue that causes Out Of Memory(OOM) errors when the pipeline runs for a long time.The recommended solution is to use Pub/Sub notifications from GCS to detect new files instead of using polling. This will eliminate the OOM errors and improve the overall reliability of the pipeline.
Testing Conducted -