MeltanoLabs / tap-gitlab

Singer.io Tap for extracting data from Gitlab's API
GNU Affero General Public License v3.0
11 stars 27 forks source link

Discussion: Allow start_date to override the state timestamp when it is more recent #42

Closed pnadolny13 closed 2 years ago

pnadolny13 commented 2 years ago

In GitLab by @tomelliff on Oct 8, 2020, 21:21

I'd like to be able to use the state between runs to minimise how much data we pull back from Gitlab.

Unfortunately the heaviest part for us is tracking back through to the first created status job as this is non terminal. Ultimately this will nearly always track back to one of the first few pipelines for some of our projects because we have manual jobs that are blocking other jobs that are then left in the created status.

To get around this I'm dynamically setting the start_date by running the following command first:

if [ $INCREMENTAL = true ]; then
  export GITLAB_API_START_DATE=$(date --date "1 month ago" --iso-8601=seconds)
else
  export GITLAB_API_START_DATE=2020-01-01T00:00:00Z
fi

This then means we only track back a month on incremental runs when there is no state. However this doesn't work when there is state because it will use the state instead and only falls back to start_date when the state is empty for the entity.

I'd like to propose the following change:

diff --git a/tap_gitlab/__init__.py b/tap_gitlab/__init__.py
index b38a292..a3ccc14 100644
--- a/tap_gitlab/__init__.py
+++ b/tap_gitlab/__init__.py
@@ -169,7 +169,7 @@ def get_url(entity, id, secondary_id=None, start_date=None):

 def get_start(entity):
-    if entity not in STATE:
+    if entity not in STATE or parse_datetime(STATE[entity]) < parse_datetime(CONFIG['start_date']):
         STATE[entity] = CONFIG['start_date']
     return STATE[entity]

This feels potentially a little bit controversial to me for some reason but if someone states a start_date that is more recent than the state is tracking it is likely because they are deliberately trying to override the state being too old for an entity. I might be missing a use case here though.

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Oct 13, 2020, 13:57

@tomelliff Per https://github.com/singer-io/getting-started/blob/master/docs/CONFIG_AND_STATE.md#special-fields, "start_date should be used on first sync to indicate how far back to grab records", which doesn't say anything about subsequent syncs, but I think it makes sense to consistently treat it as a lower bound on "how far back to grab records", even if the state suggests that we have in the past gone back earlier.

I'd gladly accept a contribution with this change!

pnadolny13 commented 2 years ago

In GitLab by @tomelliff on Oct 14, 2020, 07:36

mentioned in commit tomelliff/tap-gitlab@1a45e52283593208f35a90426265b947070a603b

pnadolny13 commented 2 years ago

In GitLab by @tomelliff on Oct 14, 2020, 07:37

mentioned in merge request !28

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Oct 14, 2020, 11:15

assigned to @tomelliff

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Oct 27, 2020, 14:38

mentioned in commit c0f729e8deeed0aade015139870cc800411d6280