ets / tap-spreadsheets-anywhere

GNU Affero General Public License v3.0
31 stars 62 forks source link

tap-spreadsheets-anywhere

This is a Singer tap that reads data from spreadsheet files (CSVs, Excel, JSONs, custom-delimited) accessible from any smart_open supported transport and produces JSON-formatted data following the Singer spec. This tap is developed for compatibility with Meltano.

How to use it

tap-spreadsheets-anywhere works together with any other Singer Target to move data from any smart_open supported transport to any target destination. smart_open supports a wide range of transport options out of the box, including:

Multiple individual files with the same schema can be configured & ingested into the same "Table" for processing.

Compression

smart_open allows reading and writing gzip and bzip2 files. They are transparently handled over HTTP, S3, and other protocols, too, based on the extension of the file being opened.

Configuration

The Meltano configuration for this tap must contain the key 'tables' which holds an array of json objects describing each set of targeted source files.

config:
  extractors:
  - name: tap-spreadsheets-anywhere
    namespace: tap_spreadsheets_anywhere
    pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
    executable: tap-spreadsheets-anywhere
    capabilities:
    - catalog
    - discover
    - state
    config:
      tables: []

To run this tap directly from the CLI, a config.json file must be supplied which holds the 'tables' array. A sample config file is available here sample_config.json and a description of the required/optional fields declared within it follow. The configuration is also captured in tables_config_util.py as a voluptuous-based configuration for validation purposes.

{
    "tables": [
        {
            "path": "s3://my-s3-bucket",
            "name": "target_table_name",
            "pattern": "subfolder/common_prefix.*",
            "start_date": "2017-05-01T00:00:00Z",
            "key_properties": [],
            "format": "csv",
            "delimiter": "|",
            "quotechar": '"',
            "universal_newlines": false,
            "skip_initial": 0,
            "sample_rate": 10,
            "max_sampling_read": 2000,
            "max_sampled_files": 3,
            "prefer_number_vs_integer": true,
            "prefer_schema_as_string": true,
            "selected": true,

            // for any field in the table, you can hardcode the json schema datatype to override
            // the schema infered through discovery mode. 
            // *Note Meltano users* - the scheam override support delivered in Meltano v1.41.1 is more robust
            //  and should be preferred to this tap-specific override functionality.  
            "schema_overrides": {
                "id": {
                    "type": ["null", "integer"],
                },
                // if you want the tap to enforce that a field is not nullable, you can do it like so:
                "first_name": {
                    "type": "string",
                }
            }
        },
        {
            "path": "sftp://username:password@host//path/file",
            "name": "another_table_name",
            "pattern": "subdir/.*User.*",
            "start_date": "2017-05-01T00:00:00Z",
            "key_properties": ["id"],
            "format": "excel", 
            // you must specify the worksheet name to pull from in your xls(x) file.
            "worksheet_name": "Names"
        }
    ]
}

Each object in the 'tables' array describes one or more CSV or Excel spreadsheet files that adhere to the same schema and are meant to be tapped as the source for a Singer-based data flow.

Automatic Config Generation

This is an experimental feature used to crawl a path and generate a config block for every file encountered. An intended use-case is where source files are organized in subdirectories by intended target table. This mode will generate a config block for each subdirectory and for each file format within it. The following example config file will crawl the s3 bucket my-example-bucket and produce config blocks for each folder under it where source files are detected.

{
    "tables": [
        {
            "crawl_config": true,
            "path": "s3://my-example-bucket",
            "pattern": ".*"
        }
    ]
}

Typically this mode will be used when there are many streams to be configured and processed. Therefore, generating the catalog independently is generally helpful.

meltano invoke --dump=catalog tap-spreadsheets-anywhere > my-catalog.json
meltano elt --catalog=my-catalog.json --job_id=my-job-state tap-spreadsheets-anywhere any-loader

JSON support

JSON files are expected to parse as a root-level array of objects where each object is a set of flat key-value pairs.

[
    { "name": "row one", "key": 42},
    { "name": "row two", "key": 43}
]

JSONL (JSON Lines) support

JSONL files are expected to parse as one object per line, where each row in a file is a set of key-value pairs.

{ "name": "row one", "key": 42}
{ "name": "row two", "key": 43}

Authentication and Credentials

This tap authenticates with target systems as described in the smart_open documentation here.

State

This tap is designed to continually poll a configured directory for any unprocessed files that match a table configuration and to process any that are found. On the first syncing run, the declared start_date will be used to filter the set of files that match the search_prefix and pattern expressions. The last modified date of the most recently synced file will then be written to state and used in place of start_date on the next syncing run.

While state is maintained, only new files will be processed from subsequent runs.

Install and Run outside of Meltano

First, make sure Python 3 is installed on your system. Then, execute create_virtualenv.sh to create a local venv and install the necessary dependencies. If you are executing this tap outside of Meltano then you will need to supply the config.json file yourself. A sample configuration is available here sample_config.json You can invoke this tap directly with:

python -m tap_spreadsheets_anywhere --config config.json

History:

Copyright © 2020 Eric Simmerman