Closed pnadolny13 closed 2 years ago
In GitLab by @rabidaudio on Jan 28, 2021, 17:07
I just noticed !33 which once implemented might make this easier, but I don't see it as a blocker
In GitLab by @DouweM on Feb 4, 2021, 13:29
Result of meltano invoke tap-gitlab --discover > tap-gitlab.catalog.json
: tap-gitlab.catalog.json
In GitLab by @DouweM on Feb 4, 2021, 13:42
Commented on tap_gitlab/init.py line 585
Please move CATALOG.get_stream('users')
to a variable so that we don't need to repeat it here and on line 575!
In GitLab by @DouweM on Feb 4, 2021, 14:00
@rabidaudio Thanks for working on this!
projects/groups get queried regardless of they are selected, because they are necessary for the other streams
Makes sense.
the global
CATALOG
is gross, but avoided adding an argument to every method
Makes sense, works for me.
What would be the best way to handle
fetch_merge_request_commits
andfetch_pipelines_extended
? Ideally we'd switch these with selection. How do we deprecate them and how do they interact with selection?
Good question.
Since the merge_request_commits
and pipelines_extended
streams are significantly more expensive than the others (they require one request per MR and pipeline, instead of a single index endpoint returning multiple records), I think it makes sense to continue to require users to explicitly opt-in to them, to prevent every new users from finding the tap unreasonably slow.
Ideally, the tap would flag in its catalog that a certain stream should not be selected by default, and the spec actually defines a selected-by-default
metadata property that indicates "if a node in the schema should be replicated if a user has not expressed any opinion on whether or not to replicate it," but unfortunately Meltano doesn't currently respect that option because it can't currently distinguish between the select
extra's default value of ['*.*']
(meaning all available streams, all available fields), and a user-provided value of *.*
coming from meltano.yml
, so in both cases it'll just select all streams that show up in the catalog as inclusion: available
. To support select-by-default
properly, Meltano would need to have an alternative to *
that doesn't mean "all available", but "all selected-by-default": https://gitlab.com/meltano/meltano/-/issues/2556
This means that as things stand, a setting is the most appropriate way to opt-in to a specific stream, and we don't need to deprecate the settings.
The way it's implemented right now, if the settings are disabled, the streams don't show up at all, but we could also let them show up with inclusion: unsupported
, so that Meltano won't select them by default but it will be clear in meltano select tap-gitlab --list --all
that they exist. The docs can then explain that the streams are only supported when the appropriate setting is enabled.
When the setting is enabled, I think including the streams with inclusion: available
as we have now makes sense, even if we could also use inclusion: automatic
to make sure they're always included and can't be deselected. Letting the streams be deselected even when the setting is enabled is useful though, since it makes the --select
and --exclude
flags on meltano elt
work.
So I propose leaving the settings and implementing behavior like inclusion: fetch_<stream> ? "available" : "unsupported"
Perhaps
pipelines_extended
wouldn't need to be an extra table anymore, but that would be a much bigger breaking change
Definitely, and it may be unexpected that marking some of the pipelines
fields as selected would result in an extra request per pipeline, slowing down things significantly.
Especially since Meltano would also select these fields automatically, and the slow behavior would be the default behavior.
In GitLab by @DouweM on Feb 4, 2021, 14:00
assigned to @rabidaudio and @DouweM
In GitLab by @rabidaudio on Feb 4, 2021, 16:01
Commented on tap_gitlab/init.py line 585
changed this line in version 2 of the diff
In GitLab by @rabidaudio on Feb 4, 2021, 16:01
added 1 commit
In GitLab by @rabidaudio on Feb 4, 2021, 16:02
Commented on tap_gitlab/init.py line 585
good catch, fixed
In GitLab by @rabidaudio on Feb 4, 2021, 16:06
I thought about this during development. What about selection by groups and projects, something like:
select:
- 'meltano/meltano/*.*'
- '!meltano/tap-gitlab/pipelines.*'
The problem is to run selection you'd need to query every group and project, which isn't viable on GitLab.com or any other large GitLab installation. But maybe there's a similar way to accomplish the same thing?
In GitLab by @DouweM on Feb 4, 2021, 16:17
I like the idea of having some standardized way to define scopes/filters across taps, but since they wouldn't actually be individual streams that we'd expect to get their own tables in the destination database, I don't think the catalog and stream selection rules are the most appropriate place for this.
In GitLab by @rabidaudio on Feb 4, 2021, 18:03
added 1 commit
In GitLab by @rabidaudio on Feb 4, 2021, 18:04
This makes a lot of sense. Thanks for the help!
Updated in 35b33a45650c52ca64f5773b9e2ff9c40d039640 @DouweM
In GitLab by @DouweM on Feb 5, 2021, 13:28
Commented on tap_gitlab/init.py line 669
Can we only print the schema if the stream is selected? I think we can use stream.is_selected()
here, instead of directly checking the settings.
In GitLab by @DouweM on Feb 5, 2021, 13:36
@rabidaudio I just tested this locally, and it works like a charm, except for some unnecessary SCHEMA
messages (see thread above)
I've also created an issue to update the Meltano discovery.yml and docs once that's resolved and this MR goes in: https://gitlab.com/meltano/meltano/-/issues/2557
In GitLab by @rabidaudio on Feb 8, 2021, 12:06
Commented on tap_gitlab/init.py line 669
changed this line in version 4 of the diff
In GitLab by @rabidaudio on Feb 8, 2021, 12:06
added 1 commit
In GitLab by @rabidaudio on Feb 8, 2021, 12:07
Commented on tap_gitlab/init.py line 669
Good catch @DouweM, fixed
In GitLab by @rabidaudio on Feb 8, 2021, 12:09
Commented on tap_gitlab/init.py line 669
get_selected_streams
logs which streams are skipped, allowing me to remove all the manual logging
In GitLab by @DouweM on Feb 9, 2021, 12:12
resolved all threads
In GitLab by @DouweM on Feb 9, 2021, 12:19
changed title from {-Discovery-} to {+Implement discovery mode and stream selection+}
In GitLab by @DouweM on Feb 9, 2021, 12:20
mentioned in commit 3d9b793bf6cf5e0b11c1bfeab9e6bae36f1e7fa9
In GitLab by @DouweM on Feb 9, 2021, 12:21
@rabidaudio Merged and released as v0.9.12, thanks for your help!
In GitLab by @DouweM on Feb 10, 2021, 11:08
mentioned in commit fcf63394ba400fbf13f946fded546ae4337f1c2b
In GitLab by @DouweM on Feb 10, 2021, 11:09
mentioned in merge request !35
In GitLab by @DouweM on Feb 10, 2021, 11:20
Temporarily reverted in https://gitlab.com/meltano/tap-gitlab/-/merge_requests/35
In GitLab by @rabidaudio on Jan 27, 2021, 20:59
Merges discovery -> master
Closes #25
I ended up running into a situation where I need discovery on this tap, so I took a stab at #25. I think there's still some room for improvement, but I wanted to lead with the minimum number of changes to start the conversation.
A few quirks:
CATALOG
is gross, but avoided adding an argument to every methodfetch_merge_request_commits
andfetch_pipelines_extended
? Ideally we'd switch these with selection. How do we deprecate them and how do they interact with selection? Perhapspipelines_extended
wouldn't need to be an extra table anymore, but that would be a much bigger breaking change