Replication key field `finished_at` of `runs` stream can sometimes be null

edgarrmondragon commented 1 year ago

From a Slack conversation:

2. Incremental replication where the replication key is sometimes null: Again in tap-dbt, the runs stream is set to replicate incrementally using finished_at as the replication key. However this field is sometimes NULL for our runs. Are there any workarounds for this, aside from tweaking our local code to replicate the runs table with full table replication?

I need to dig into the API docs to see what's going and maybe come up with a workaround (other than overriding the replication method in the Singer catalog).

Help from other users of this tap is more than welcome!

mjsqu commented 1 year ago

The finished_at property was chosen for the runs stream because it is one of the keys that API requests can be ordered by. Unfortunately it looks like the ordering keys are not documented - but one can try hitting the following endpoints:

api/v2/accounts/1/runs/?order_by=-updated_at
api/v2/accounts/1/runs/?order_by=-finished_at

At our site the first returns a message that provides the required order_by keys:

{
    "status": {
        "code": 400,
        "is_success": false,
        "user_message": "The request was invalid. Please double check the provided data and try again.",
        "developer_message": ""
    },
    "data": {
        "reason": "Invalid order_by value. Use one of [id, created_at, finished_at, -id, -created_at, -finished_at] instead."
    }
}

Ascending or descending:

id
created_at
finished_at

Helpful links:

mjsqu commented 1 year ago

The problem with using created_at is that the following scenario may occur:

A new run id=1234 is created at 10am
Another run id=1235 is created at 10:05am
At 11am, both runs are still active. The tap runs without a state bookmark and extracts all runs. It stores the highest created_at value as 10:05am
At 11:15am, run id=1234 finishes, the dbt Cloud record is updated with finishing status, finished_at etc.
At 11:30am the tap runs in incremental mode. It checks off the records in reverse created_at order and stops when it reaches 10:05am - creating and outputting a final RECORD message containing id=1235.
The updated status of id=1234 is not extracted because the created_at value for that run is 10am, before the bookmark value.

I think that makes sense, but please feel free to check my logic.

I was motivated to create an incremental replication method for the runs endpoint because we have a lot of job runs at our site, however if you have lower volumes of runs, a full_table style replication may be preferable. Is it possible to select that style of replication and override the incremental method?

mjsqu commented 1 year ago

Just noted the Slack comment said:

Are there any workarounds for this, aside from tweaking our local code to replicate the runs table with full table replication?

Which invalidates the final paragraph of my previous comment

MeltanoLabs / tap-dbt

Replication key field `finished_at` of `runs` stream can sometimes be null #213