Widen / tap-rest-api-msdk

`tap-rest-api-msdk` is a Singer tap for generic rest-apis, built with the Meltano SDK for Singer Taps.
Apache License 2.0
22 stars 25 forks source link

tap-rest-api-msdk

singer_rest_api_tap

tap-rest-api-msdk is a Singer tap for generic rest-apis. The novelty of this particular tap is that it will auto-discover the stream's schema.

This is particularly useful if you have a stream with a very large and complex schema or a stream that outputs records with varying schemas for each record. Can also be used for simpler more reliable streams.

There are many forms of Authentication supported by this tap. By default for legacy support, you can pass Authentication via headers. If you want to use the built in support for Authentication, this tap supports

Please note that OAuthJWTAuthentication has not been developed. If you are interested in contributing this, please fork and make a pull request.

Built with the Meltano SDK for Singer Taps.

Gratitude goes to anelendata for inspiring this "SDK-ized" version of their tap.

Installation

If using via Meltano, add the following lines to your meltano.yml file and run the following command:

plugins:
  extractors:
    - name: tap-rest-api-msdk
      namespace: tap_rest_api_msdk
      pip_url: tap-rest-api-msdk
      executable: tap-rest-api-msdk
      capabilities:
        - state
        - catalog
        - discover
      settings:
        - name: api_url
          kind: string
        - name: next_page_token_path
          kind: string
        - name: pagination_request_style
          kind: string
        - name: pagination_response_style
          kind: string
        - name: use_request_body_not_params
          kind: boolean
        - name: backoff_type
          kind: string
        - name: backoff_param
          kind: string
        - name: backoff_time_extension
          kind: integer
        - name: store_raw_json_message
          kind: boolean
        - name: pagination_page_size
          kind: integer
        - name: pagination_results_limit
          kind: integer
        - name: pagination_next_page_param
          kind: string
        - name: pagination_limit_per_page_param
          kind: string
        - name: pagination_total_limit_param
          kind: string
        - name: pagination_initial_offset
          kind: integer
        - name: streams
          kind: array
        - name: name
          kind: string
        - name: path
          kind: string
        - name: params
          kind: object
        - name: headers
          kind: object
        - name: records_path
          kind: string
        - name: primary_keys
          kind: array
        - name: replication_key
          kind: string
        - name: except_keys
          kind: array
        - name: num_inference_records
          kind: integer
        - name: start_date
          kind: date_iso8601
        - name: source_search_field
          kind: string
        - name: source_search_query
          kind: string
        - name: auth_method
          kind: string
        - name: api_key
          kind: object
        - name: client_id
          kind: password
        - name: client_secret
          kind: password
        - name: username
          kind: string
        - name: password
          kind: password
        - name: bearer_token
          kind: password
        - name: refresh_token
          kind: oauth
        - name: grant_type
          kind: string
        - name: scope
          kind: string
        - name: access_token_url
          kind: string
        - name: redirect_uri
          kind: string
        - name: oauth_extras
          kind: object
        - name: oauth_expiration_secs
          kind: integer
        - name: aws_credentials
          kind: object
meltano install extractor tap-rest-api-msdk

Configuration

Accepted Config Options

A full list of supported settings and capabilities for this tap is available by running:

tap-rest-api-msdk --about

Top-level config options.

Parameters that appear at the stream-level will overwrite their top-level counterparts except where noted in the stream-level params. Otherwise, the values provided at the top-level will be the default values for each stream.:

Stream level config options.

Parameters that appear at the stream-level will overwrite their top-level counterparts except where noted below:

Top-Level Authentication config options.

Complex Authentication

The previous section showed out of the box methods for a single factor of authentication e.g. x-api-key, basic or oauth. If the API requires multiple forms of authentication, you may need to pass some of the authentication methods via the headers to be be combined with the main auth_method.

Example: An API may use OAuth2 for authentication but also requires an X-API-KEY to be supplied as well. In this situation pass the X-API-KEY as part of the headers config, and the rest of the config should be set for OAuth e.g.

Some servers may require additional information like a Request-Context which is usually Base64 encoded. If this is the case it should be included in the headers config as well.

Example:

Pagination

API Pagination is a complex topic as there is no real single standard, and many different implementations. Unless options are provided, both the request and results style type default to the default, which is the pagination style originally implemented. Where possible, this tap utilises the Meltano SDK paginators https://sdk.meltano.com/en/latest/reference.html#pagination .

Default Request Style

The default request style for pagination is using a JSONPath Paginator to locate the next page token.

Default Response Style

The default response style for pagination is described below:

Additional Request / Paginator Styles

There are additional request styles supported as follows for pagination.

Additional Response Styles

There are additional response styles supported as follows.

JSON Path for extracting tokens

The next_page_token_path and records_path use JSONPath to locate sections within the request reponse.

The following example extracts the URL for the next pagination page.

    "next_page_token_path": "$.link[?(@.relation=='next')].url."

The following example demonstrates the power of JSONPath extensions by further splitting the URL and extracting just the parameters. Note: This is not required for FHIR API's but is provided for illustration of added functionality for complex use cases.

    "next_page_token_path": "$.link[?(@.relation=='next')].url.`split(?, 1, 1)`"

The JSONPath Evaluator website is useful to test the correct json path expression to use.

Example json response from a FHIR API.

```json
{
  "resourceType": "Bundle",
  "id": "44f2zf06-g53c-4218-a3ef-08bb6c2fde4a",
  "meta": {
    "lastUpdated": "2022-06-28T18:25:01.165+12:00"
  },
  "type": "searchset",
  "total": 63,
  "link": [
    {
      "relation": "self",
      "url": "https://myexample_fhir_api_url/base_folder/ExampleService?_count=10&_getpageoffset=10&services-provided-type=MY_INITIAL_EXAMPLE_SERVICE"
    },
    {
      "relation": "next",
      "url": "https://myexample_fhir_api_url/base_folder?_getpages=44f2zf06-g53c-4218-a3ef-08bb6c2fde4a&_getpagesoffset=10&_count=10&_pretty=true&_bundletype=searchset"
    }
  ],
  "entry": [
    {
      "fullUrl": "https://myexample_fhir_api_url/base_folder/ExampleService/example-service-123456",
      "resource": {
        "resourceType": "ExampleService",
        "id": "example-service-123456"
      }
    }
  ]

}


  Note: If you wish to extract the body from example GET request response above the following configuration parameter `records_path` will return the actual json content.
  ```json
  "records_path": "$.entry[*].resource"

Example settings for different API's

This section provides examples of settings for accessing different API's. The tap configuration examples are provide in the form of environment variables. You could easily provide a configuration file config.json instead of environment variables.

Where config values have with <removed .. > replace the text with your Authentication and API config.

Microsoft Graph API v1.0

This example uses the jsonpath paginator. In this example, it requires a Microsoft Azure AD admin to register an APP to obtain an OAuth Token to perform an OAuth flow with the Microsoft Graph API. The details below may be different based on your setup, adjust accordingly.

Result: Two streamed datasets, one whoami a simple json response about yourself, two a sharepoint list my_sharepoint_list.

# Access MSOFFICE objects via the GraphAPI
export TAP_REST_API_MSDK_API_URL=https://graph.microsoft.com
export TAP_REST_API_MSDK_PAGINATION_REQUEST_STYLE="jsonpath_paginator"
export TAP_REST_API_MSDK_PAGINATION_RESPONSE_STYLE="hateoas_body"
export TAP_REST_API_MSDK_NEXT_PAGE_TOKEN_PATH="$['@odata.nextLink']"
export TAP_REST_API_MSDK_START_DATE="2001-01-01T00:00:00.00+12:00"
export TAP_REST_API_MSDK_AUTH_METHOD="oauth"
export TAP_REST_API_MSDK_USERNAME="<removed place in UPN/email address>"
export TAP_REST_API_MSDK_PASSWORD="<removed place in password>"
export TAP_REST_API_MSDK_GRANT_TYPE="password"
export TAP_REST_API_MSDK_ACCESS_TOKEN_URL="https://login.microsoftonline.com/<removed place in Azure AAD APP ID>/oauth2/v2.0/token"
export TAP_REST_API_MSDK_CLIENT_ID="<removed place in OAuth Client ID>"
export TAP_REST_API_MSDK_CLIENT_SECRET="<removed place in OAuth Client Secret>"
export TAP_REST_API_MSDK_SCOPE="<removed place in client scope url e.g. https://graph.microsoft.com/user.read>"
export TAP_REST_API_MSDK_STREAMS='[{"name": "whoami", "path": "/v1.0/me", "primary_keys": ["id"]},{"name": "my_sharepoint_list", "path": "/v1.0/sites/<removed place in SharePoint Site ID>/Lists/<removed place in SharePoint list id>/items/?expand=columns,items(expand=fields)", "primary_keys": ["id"], "records_path": "$.value[*].fields"}]'

Gitlab API

This example uses the simple header paginator and returns 50 records from the Gitlab API for Projects. Note: There is an exception raised due to the 50 record limit - this is an example hence the limit.

# Access Gitlab projects via the GitLab API
export TAP_REST_API_MSDK_API_URL=https://gitlab.com/api/v4/projects
export TAP_REST_API_MSDK_PAGINATION_REQUEST_STYLE="simple_header_paginator"
export TAP_REST_API_MSDK_PAGINATION_RESULTS_LIMIT=50
export TAP_REST_API_MSDK_STREAMS='[{"name": "gitlab_projects", "primary_keys": ["id"]}]'

You could authenticate to Gitlab using a Personal Access Token (PAT) by adding this config.

export TAP_REST_API_MSDK_HEADERS='{"Authorization": "Bearer <removed PAT bearer token>"}'

GitHub API

This example uses the headerlink paginator and returns approximately 250 records from the GitHub API for Projects.

# Access GitHub users via the GitHub API
export TAP_REST_API_MSDK_API_URL=https://api.github.com/users
export TAP_REST_API_MSDK_PAGINATION_REQUEST_STYLE="restapi_header_link_paginator"
export TAP_REST_API_MSDK_PAGINATION_RESPONSE_STYLE="header_link"
export TAP_REST_API_MSDK_PAGINATION_PAGE_SIZE=50
export TAP_REST_API_MSDK_PAGINATION_RESULTS_LIMIT=250
export TAP_REST_API_MSDK_STREAMS='[{"name": "github_users", "primary_keys": ["id"]}]'

You could authenticate to GitHub using a Personal Access Token (PAT) by adding this config.

export TAP_REST_API_MSDK_HEADERS='{"Authorization": "Bearer <removed PAT bearer token>"}'

FHIR API

This example uses the jsonpath paginator to access a FHIR API. It uses the hateoas response style to process the next tokens.

This particular configuration will do an intial load of all data for a given resource defined in the streams config from the 01-Jan-2001. It will in subsequent runs incrementally pull changed data based on the lastUpdated timestamp by searching for records greater than the highest last updated timestamp. In this example the PlanDefinition FHIR resource is being extracted.

You will need appropriate OAuth Token details provided by the Administrator of the API.

export TAP_REST_API_MSDK_API_URL=<remove put in the FHIR API url>
export TAP_REST_API_MSDK_PAGINATION_REQUEST_STYLE="jsonpath_paginator"
export TAP_REST_API_MSDK_PAGINATION_RESPONSE_STYLE="hateoas_body"
export TAP_REST_API_MSDK_NEXT_PAGE_TOKEN_PATH="$.link[?(@.relation=='next')].url"
export TAP_REST_API_MSDK_START_DATE="2001-01-01T00:00:00.00+12:00"
export TAP_REST_API_MSDK_AUTH_METHOD="oauth"
export TAP_REST_API_MSDK_GRANT_TYPE="client_credentials"
export TAP_REST_API_MSDK_ACCESS_TOKEN_URL="https://login.microsoftonline.com/<removed place in Azure AAD APP ID>/oauth2/v2.0/token"
export TAP_REST_API_MSDK_CLIENT_ID="<removed place in OAuth Client ID>"
export TAP_REST_API_MSDK_CLIENT_SECRET="<removed place in OAuth Client Secret>"
export TAP_REST_API_MSDK_SCOPE="<removed place in client scope url>"
export TAP_REST_API_MSDK_STREAMS='[{"name":"plan_definition","path":"/PlanDefinition","primary_keys":["id"],"records_path":"$.entry[*].resource","replication_key":"meta_lastUpdated","search_parameter":"_lastUpdated","source_search_query": "gt$last_run_date"}]'

NOAA API Example

This example uses the offset paginator to access the NOAA API to return location categories. In this example the offset tokens are not in the default location of pagination so the next_page_token_path is set to the NOAA API offset location in the json response i.e. '$.metadata.resultset'. This example also sets a limit parameter in the streams to only return 5 records at a time to prove the pagination is working.

# Access Locations Categories objects via the NOAA API
export TAP_REST_API_MSDK_API_URL=https://www.ncei.noaa.gov/cdo-web/api/v2
export TAP_REST_API_MSDK_HEADERS='{"token": "<enter NOAA token>"}'
export TAP_REST_API_MSDK_NEXT_PAGE_TOKEN_PATH='$.metadata.resultset'
export TAP_REST_API_MSDK_PAGINATION_REQUEST_STYLE="offset_paginator"
export TAP_REST_API_MSDK_PAGINATION_RESPONSE_STYLE="style1"
export TAP_REST_API_MSDK_PAGINATION_TOTAL_LIMIT_PARAM="count"
export TAP_REST_API_MSDK_STREAMS='[{"name": "locationcategories", "params": {"limit": "5"}, "path": "/locationcategories", "primary_keys": ["id"], "records_path": "$.results[*]"}]'

dbt Cloud API Example

This example uses the offset paginator to access the dbt Cloud API to return location categories. In this example the offset tokens are not in the default location of pagination so the next_page_token_path is set to the dbt API offset location in the json response i.e. '$.extra'. This example also sets the streams record_path to "$.data[*]" which is the location of the data.

# Access Locations Categories objects via the dbt Cloud API
# Access Gitlab objects via the dbt Cloud API
export TAP_REST_API_MSDK_API_URL=https://<removed your url>.getdbt.com/api/v2/accounts/<removed account id>
export TAP_REST_API_MSDK_HEADERS='{"Authorization": "Bearer <removed place in bearer token>"}'
export TAP_REST_API_MSDK_NEXT_PAGE_TOKEN_PATH='$.extra'
export TAP_REST_API_MSDK_PAGINATION_REQUEST_STYLE="offset_paginator"
export TAP_REST_API_MSDK_PAGINATION_RESPONSE_STYLE="style1"
export TAP_REST_API_MSDK_PAGINATION_TOTAL_LIMIT_PARAM="total_count"
export TAP_REST_API_MSDK_STREAMS='[{"name": "jobs", "path": "/jobs", "primary_keys": ["id"], "records_path": "$.data[*]"}]'

AWS OpenSearch API Example

This complex example uses the AWS4Auth authenticator to provide signed AWS credentials in the requests to the AWS OpenSearch API endpoint. The auth_method is set to 'aws', and the required aws_credentials are provided.

Note: The AWS authentication does support boto3 environment variables and aws_profiles.

For pagination, the next page token is located in the last returned record. Using JSON Path the appropriate token response can be extracted via '$.hits.hits[-1:].sort' (-1 selects the last record in the array) and this is set in the next_page_token_path config setting. The API parameter used to select the next page is 'search_after' and this is set in the pagination_next_page_param config setting. To enable pagination in OpenSearch an API parameter named 'sort' must be set to a unique key e.g. '_id'. The number of records to returned per page is controlled via an API parameter called 'size'.

The OpenSearch API has a complex incremental replication query which must be sent in the request body. This is enabled by setting the use_request_body_not_params to True.

Finally set the replication suite of config settings ( start_date, replication_key, source_search_field, and source_search_query ) to enable incremental replication of data since the last run. For OpenSearch there is a complex query template which must be set in the streams source_search_query config setting.

Unlike most API requests, the API query is against an API parameter named query rather than the name of the API field. For this reason the source_search_field is set to 'query' in the streams array. Additionally, the streams record_path to "$.hits.hits[*]" which is the location of the records in the requests response.

# Access AWS objects via the AWS Open/Elastic Search API
export TAP_REST_API_MSDK_API_URL="https://<endpoint>.<aws region>.<aws service>.amazonaws.com"
export TAP_REST_API_MSDK_AWS_CREDENTIALS='{"aws_access_key_id": "<removed aws access key id>", "aws_secret_access_key": "removed aws secret access key>", "aws_region": "<aws region e.g. us‑east‑1>", "aws_service": "<aws service e.g. es for opensearch>", "create_signed_credentials": true}'
export TAP_REST_API_MSDK_START_DATE="2001-01-01T00:00:00.00+12:00"
export TAP_REST_API_MSDK_PAGINATION_REQUEST_STYLE="jsonpath_paginator"
export TAP_REST_API_MSDK_PAGINATION_RESPONSE_STYLE="offset"
export TAP_REST_API_MSDK_USE_REQUEST_BODY_NOT_PARAMS=true
export TAP_REST_API_MSDK_NEXT_PAGE_TOKEN_PATH='$.hits.hits[-1:].sort'
export TAP_REST_API_MSDK_PAGINATION_NEXT_PAGE_PARAM="search_after"
export TAP_REST_API_MSDK_AUTH_METHOD='aws'
export TAP_REST_API_MSDK_STREAMS='[{"name": "careplan", "params": {"size": 100, "sort": "_id"}, "path": "/careplan/_search", "primary_keys": [], "records_path": "$.hits.hits[*]", "replication_key": "_source_meta_lastUpdated", "source_search_field": "query", "source_search_query": "{\"bool\": {\"filter\": [{\"range\": { \"meta.lastUpdated\": { \"gt\": \"$last_run_date\" }}}] }}"}]'

Usage

You can easily run tap-rest-api-msdk by itself or in a pipeline using Meltano.

Executing the Tap Directly

tap-rest-api-msdk --version
tap-rest-api-msdk --help
tap-rest-api-msdk --config CONFIG --discover > ./catalog.json

or

bash tap-rest-api-msdk --config=config.sample.json

Developer Resources

Initialize your Development Environment

pipx install poetry
poetry install

Create and Run Tests

Create tests within the tests/ directory and then run:

poetry run pytest

You can also test the tap-rest-api-msdk CLI interface directly using poetry run:

poetry run tap-rest-api-msdk --help

Continuous Integration

Run through the full suite of tests and linters by running

poetry run tox -e py

These must pass in order for PR's to be merged.

Testing with Meltano

Note: This tap will work in any Singer environment and does not require Meltano. Examples here are for convenience and to streamline end-to-end orchestration scenarios.

This project comes with an example meltano.yml project file already created.

Next, install Meltano (if you haven't already) and any needed plugins:

# Install meltano
pipx install meltano
# Initialize meltano within this directory
cd tap-rest-api-msdk
meltano install

Now you can test and orchestrate using Meltano:

# Test invocation:
meltano invoke tap-rest-api-msdk --version
# OR run a test `elt` pipeline:
meltano elt tap-rest-api-msdk target-jsonl

SDK Dev Guide

See the dev guide for more instructions on how to use the SDK to develop your own taps and targets.