MeltanoLabs / tap-gitlab

Singer.io Tap for extracting data from Gitlab's API
GNU Affero General Public License v3.0
8 stars 25 forks source link

Load variables #64

Closed wersly closed 2 years ago

wersly commented 2 years ago

How was this code tested?

This code was tested locally with a meltano.yml file to the effect of: (note, some fields are redacted or replaced with meaningless values for privacy)

version: 1
send_anonymous_usage_stats: true
project_id: ***
plugins:
  extractors:
  - name: tap-gitlab
    pip_url: git+https://github.com/wersly/tap-gitlab.git@load-variables
    config:
      api_url: ***
      private_token: ***
      groups: some/group
      projects: some/group/project
      start_date: '1970-01-01T00:00:00Z'
      ultimate_license: true
      fetch_merge_request_commits: false
      fetch_pipelines_extended: false
    capabilities:
    - state
    - catalog
    - discover
  loaders:
  - name: target-bigquery
    variant: adswerve
    config:
      project_id: foo
      dataset_id: bar
      location: ***
      validate_records: true
      add_metadata_columns: true
      replication_method: truncate
      table_prefix: some_prefix_
  - name: target-jsonl
    variant: andyh1203
    config:
      destination_path: output
      do_timestamp_file: true
  - name: target-sqlite
    variant: meltanolabs
    pip_url: git+https://github.com/MeltanoLabs/target-sqlite.git
    config:
      database: foo.db
  transformers:
  - name: dbt
    pip_url: 'dbt-core~=1.0.0 dbt-postgres~=1.0.0 dbt-redshift~=1.0.0 dbt-snowflake~=1.0.0
      dbt-bigquery~=1.0.0

      '
    config:
      target: big-query
  files:
  - name: dbt
    pip_url: git+https://gitlab.com/meltano/files-dbt.git@config-version-2
environments:
- name: dev
  config:
    plugins:
      extractors:
      - name: tap-gitlab
        select:
        - project_variables.*
        - group_variables.*

And the following meltano operations were performed:

1. meltano elt tap-gitlab target-jsonl
2. meltano elt tap-gitlab target-sqlite
3. meltano elt tap-gitlab target-bigquery
4. meltano elt tap-gitlab target-bigquery --transform run

In all cases, the project_variables and group_variables were loaded to their targets with the provided schemas. In the case of interactions with databases (sqlite, bigquery) tables were appropriately truncated per the default replication methods for project_variables and group_variables when meltano operations were performed multiple times. Likewise, dbt was able to run on the data without regression.

Please let me know if there are any additional tests or modifications you'd like me to run on this.

Risks, Tradeoffs, Backwards Compatibility Issues

None that I can really see. The sync_variables function is essentially a copy-paste from the sync_labels function (very similar pattern in the GitLab API between Group/Project labels and Group/Project variables), so any risks assumed there are also assumed here.

While it is not a tradeoff, I would like to point out the key_properties I've selected for the group and project variables - the GitLab API does not assign any sort of id field to these data. So instead, the project/group id (assigned by the sync_variables function) and the key (from GitLab) are taken together to form a compound key. Variable keys must be unique within GitLab CI/CD Variables for a single Project or Group, but of course they can be duplicated across Projects/Groups. So the combination of Group/Project ID and variable key seems like the correct natural key for this data to me.

See:

edgarrmondragon commented 2 years ago

Hi @wersly and thanks for submitting this PR!

I feel like users wouldn't want these streams enabled by default as they might inadvertently land secrets in their data warehouse.

So what do you think about making them opt-in in the tap configuration with:

CONFIG = {
    'api_url': "https://gitlab.com/api/v4",
    'private_token': None,
    'start_date': None,
    'groups': '',
    'ultimate_license': False,
    'fetch_merge_request_commits': False,
    'fetch_pipelines_extended': False,
    'fetch_group_variables': False,
    'fetch_project_variables': False,
}

...

STREAM_CONFIG_SWITCHES = (
    'merge_request_commits',
    'pipelines_extended',
    'group_variables',
    'project_variables',
)

...

CONFIG['ultimate_license'] = truthy(CONFIG['ultimate_license'])
CONFIG['fetch_merge_request_commits'] = truthy(CONFIG['fetch_merge_request_commits'])
CONFIG['fetch_pipelines_extended'] = truthy(CONFIG['fetch_pipelines_extended'])
CONFIG['fetch_group_variables'] = truthy(CONFIG['fetch_group_variables'])
CONFIG['fetch_project_variables'] = truthy(CONFIG['fetch_project_variables'])
wersly commented 2 years ago

Hi @edgarrmondragon - wonderful idea, thanks for catching that!

Your suggested config/code looks good to me. I'll get around to implementing and testing this all for you soon.

wersly commented 2 years ago

Alright @edgarrmondragon , got around to implementing and testing your suggestions. It all looks good to me!

I did the following tests:

  1. Ran meltano elt tap-gitlab target-jsonl with the following meltano.yml config snippet:

    extractors:
    - name: tap-gitlab
    pip_url: git+https://github.com/wersly/tap-gitlab.git@load-variables
    config:
      api_url: ***
      private_token: ***
      groups: some/group
      projects: some/group/project
      start_date: '1970-01-01T00:00:00Z'
      ultimate_license: true
      fetch_merge_request_commits: false
      fetch_pipelines_extended: false

    Result: default false values for fetch_group_variables and fetch_project_variables were assumed; no group or project variables were extracted / these streams were skipped.

  2. Ran meltano elt tap-gitlab target-jsonl with the following meltano.yml config snippet:

    extractors:
    - name: tap-gitlab
    pip_url: git+https://github.com/wersly/tap-gitlab.git@load-variables
    config:
      api_url: ***
      private_token: ***
      groups: some/group
      projects: some/group/project
      start_date: '1970-01-01T00:00:00Z'
      ultimate_license: true
      fetch_merge_request_commits: false
      fetch_pipelines_extended: false
      fetch_group_variables: false
      fetch_project_variables: false

    Result: specified configuration is applied; no group or project variables were extracted / these streams were skipped.

  3. Ran meltano elt tap-gitlab target-jsonl with the following meltano.yml config snippet:

    extractors:
    - name: tap-gitlab
    pip_url: git+https://github.com/wersly/tap-gitlab.git@load-variables
    config:
      api_url: ***
      private_token: ***
      groups: some/group
      projects: some/group/project
      start_date: '1970-01-01T00:00:00Z'
      ultimate_license: true
      fetch_merge_request_commits: false
      fetch_pipelines_extended: false
      fetch_group_variables: true
      fetch_project_variables: true

    Result: specified configuration is applied; group and project variables were successfully extracted.

I also updated the README describing these new pieces of configuration and their motivations. Let me know if the language I've used there is sufficient, or if you would prefer something else be written.

sonarcloud[bot] commented 2 years ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

wersly commented 2 years ago

Thanks @edgarrmondragon !