MeltanoLabs / tap-gitlab

Singer.io Tap for extracting data from Gitlab's API
GNU Affero General Public License v3.0
8 stars 25 forks source link

Implement discovery mode and stream selection - [merged] #29

Closed pnadolny13 closed 2 years ago

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Jan 27, 2021, 20:59

Merges discovery -> master

Closes #25

I ended up running into a situation where I need discovery on this tap, so I took a stab at #25. I think there's still some room for improvement, but I wanted to lead with the minimum number of changes to start the conversation.

A few quirks:

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Jan 28, 2021, 17:07

I just noticed !33 which once implemented might make this easier, but I don't see it as a blocker

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 4, 2021, 13:29

Result of meltano invoke tap-gitlab --discover > tap-gitlab.catalog.json: tap-gitlab.catalog.json

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 4, 2021, 13:42

Commented on tap_gitlab/init.py line 585

Please move CATALOG.get_stream('users') to a variable so that we don't need to repeat it here and on line 575!

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 4, 2021, 14:00

@rabidaudio Thanks for working on this!

projects/groups get queried regardless of they are selected, because they are necessary for the other streams

Makes sense.

the global CATALOG is gross, but avoided adding an argument to every method

Makes sense, works for me.

What would be the best way to handle fetch_merge_request_commits and fetch_pipelines_extended? Ideally we'd switch these with selection. How do we deprecate them and how do they interact with selection?

Good question.

Since the merge_request_commits and pipelines_extended streams are significantly more expensive than the others (they require one request per MR and pipeline, instead of a single index endpoint returning multiple records), I think it makes sense to continue to require users to explicitly opt-in to them, to prevent every new users from finding the tap unreasonably slow.

Ideally, the tap would flag in its catalog that a certain stream should not be selected by default, and the spec actually defines a selected-by-default metadata property that indicates "if a node in the schema should be replicated if a user has not expressed any opinion on whether or not to replicate it," but unfortunately Meltano doesn't currently respect that option because it can't currently distinguish between the select extra's default value of ['*.*'] (meaning all available streams, all available fields), and a user-provided value of *.* coming from meltano.yml, so in both cases it'll just select all streams that show up in the catalog as inclusion: available. To support select-by-default properly, Meltano would need to have an alternative to * that doesn't mean "all available", but "all selected-by-default": https://gitlab.com/meltano/meltano/-/issues/2556

This means that as things stand, a setting is the most appropriate way to opt-in to a specific stream, and we don't need to deprecate the settings.

The way it's implemented right now, if the settings are disabled, the streams don't show up at all, but we could also let them show up with inclusion: unsupported, so that Meltano won't select them by default but it will be clear in meltano select tap-gitlab --list --all that they exist. The docs can then explain that the streams are only supported when the appropriate setting is enabled.

When the setting is enabled, I think including the streams with inclusion: available as we have now makes sense, even if we could also use inclusion: automatic to make sure they're always included and can't be deselected. Letting the streams be deselected even when the setting is enabled is useful though, since it makes the --select and --exclude flags on meltano elt work.

So I propose leaving the settings and implementing behavior like inclusion: fetch_<stream> ? "available" : "unsupported"

Perhaps pipelines_extended wouldn't need to be an extra table anymore, but that would be a much bigger breaking change

Definitely, and it may be unexpected that marking some of the pipelines fields as selected would result in an extra request per pipeline, slowing down things significantly.

Especially since Meltano would also select these fields automatically, and the slow behavior would be the default behavior.

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 4, 2021, 14:00

assigned to @rabidaudio and @DouweM

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Feb 4, 2021, 16:01

Commented on tap_gitlab/init.py line 585

changed this line in version 2 of the diff

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Feb 4, 2021, 16:01

added 1 commit

Compare with previous version

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Feb 4, 2021, 16:02

Commented on tap_gitlab/init.py line 585

good catch, fixed

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Feb 4, 2021, 16:06

I thought about this during development. What about selection by groups and projects, something like:

select:
 - 'meltano/meltano/*.*'
 - '!meltano/tap-gitlab/pipelines.*'

The problem is to run selection you'd need to query every group and project, which isn't viable on GitLab.com or any other large GitLab installation. But maybe there's a similar way to accomplish the same thing?

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 4, 2021, 16:17

I like the idea of having some standardized way to define scopes/filters across taps, but since they wouldn't actually be individual streams that we'd expect to get their own tables in the destination database, I don't think the catalog and stream selection rules are the most appropriate place for this.

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Feb 4, 2021, 18:03

added 1 commit

Compare with previous version

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Feb 4, 2021, 18:04

This makes a lot of sense. Thanks for the help!

Updated in 35b33a45650c52ca64f5773b9e2ff9c40d039640 @DouweM

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 5, 2021, 13:28

Commented on tap_gitlab/init.py line 669

Can we only print the schema if the stream is selected? I think we can use stream.is_selected() here, instead of directly checking the settings.

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 5, 2021, 13:36

@rabidaudio I just tested this locally, and it works like a charm, except for some unnecessary SCHEMA messages (see thread above)

I've also created an issue to update the Meltano discovery.yml and docs once that's resolved and this MR goes in: https://gitlab.com/meltano/meltano/-/issues/2557

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Feb 8, 2021, 12:06

Commented on tap_gitlab/init.py line 669

changed this line in version 4 of the diff

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Feb 8, 2021, 12:06

added 1 commit

Compare with previous version

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Feb 8, 2021, 12:07

Commented on tap_gitlab/init.py line 669

Good catch @DouweM, fixed

pnadolny13 commented 2 years ago

In GitLab by @rabidaudio on Feb 8, 2021, 12:09

Commented on tap_gitlab/init.py line 669

get_selected_streams logs which streams are skipped, allowing me to remove all the manual logging

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 9, 2021, 12:12

resolved all threads

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 9, 2021, 12:19

changed title from {-Discovery-} to {+Implement discovery mode and stream selection+}

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 9, 2021, 12:20

mentioned in commit 3d9b793bf6cf5e0b11c1bfeab9e6bae36f1e7fa9

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 9, 2021, 12:21

@rabidaudio Merged and released as v0.9.12, thanks for your help!

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 10, 2021, 11:08

mentioned in commit fcf63394ba400fbf13f946fded546ae4337f1c2b

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 10, 2021, 11:09

mentioned in merge request !35

pnadolny13 commented 2 years ago

In GitLab by @DouweM on Feb 10, 2021, 11:20

Temporarily reverted in https://gitlab.com/meltano/tap-gitlab/-/merge_requests/35