Describe the bug
When running ingestion for gcs with stateful_ingestion: enabled: true, I get the error
datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (gcs): Checkpointing provider DatahubIngestionCheckpointingProvider already registered.
Even if disabling stateful ingestion, the log will show the lines
INFO {datahub.ingestion.source.state.stateful_ingestion_base:241} - Stateful ingestion will be automatically enabled, as datahub-rest sink is used or `datahub_api` is specified
[...]
INFO {datahub.ingestion.run.pipeline:571} - Processing commit request for DatahubIngestionCheckpointingProvider. Commit policy = CommitPolicy.ALWAYS, has_errors=False, has_warnings=False
WARNING {datahub.ingestion.source.state_provider.datahub_ingestion_checkpointing_provider:95} - No state available to commit for DatahubIngestionCheckpointingProvider
INFO {datahub.ingestion.run.pipeline:591} - Successfully committed changes for DatahubIngestionCheckpointingProvider.
To Reproduce
Prerequisites:
gcs bucket
service account with Storage Object Viewer on the bucket
2. Run `datahub ingest run -c <my minimal recipe>.yml`
**Expected behavior**
The pipeline should run without errors and write the state correctly, removing any stale metadata if configured.
**Additional context**
I think I have already found a fix that I will commit in the near future. The root cause is the creation of a equivalent s3 source that leads to re-registering of the checkpointing provider. Also, the platform attribute is missing which leads to the state being written to the platform "default".
Describe the bug When running ingestion for gcs with stateful_ingestion: enabled: true, I get the error
Even if disabling stateful ingestion, the log will show the lines
To Reproduce
Prerequisites:
Steps to reproduce the behavior:
pipeline_name: gcs-pipeline
sink: type: "datahub-rest" config: server:
token: