This is a Singer tap that reads data from spreadsheet files (CSVs, Excel, JSONs, custom-delimited) accessible from any smart_open supported transport and produces JSON-formatted data following the Singer spec. This tap is developed for compatibility with Meltano.
tap-spreadsheets-anywhere
works together with any other Singer Target to move data from any smart_open supported transport to any target destination. smart_open supports a wide range of transport options out of the box, including:
D
drive use "file://d:/data/subfolder"Multiple individual files with the same schema can be configured & ingested into the same "Table" for processing.
smart_open allows reading and writing gzip and bzip2 files. They are transparently handled over HTTP, S3, and other protocols, too, based on the extension of the file being opened.
The Meltano configuration for this tap must contain the key 'tables' which holds an array of json objects describing each set of targeted source files.
config:
extractors:
- name: tap-spreadsheets-anywhere
namespace: tap_spreadsheets_anywhere
pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
executable: tap-spreadsheets-anywhere
capabilities:
- catalog
- discover
- state
config:
tables: []
To run this tap directly from the CLI, a config.json file must be supplied which holds the 'tables' array.
A sample config file is available here sample_config.json and a description of the required/optional fields declared within it follow.
The configuration is also captured in tables_config_util.py as a voluptuous
-based configuration for validation purposes.
{
"tables": [
{
"path": "s3://my-s3-bucket",
"name": "target_table_name",
"pattern": "subfolder/common_prefix.*",
"start_date": "2017-05-01T00:00:00Z",
"key_properties": [],
"format": "csv",
"delimiter": "|",
"quotechar": '"',
"universal_newlines": false,
"skip_initial": 0,
"sample_rate": 10,
"max_sampling_read": 2000,
"max_sampled_files": 3,
"prefer_number_vs_integer": true,
"prefer_schema_as_string": true,
"selected": true,
// for any field in the table, you can hardcode the json schema datatype to override
// the schema infered through discovery mode.
// *Note Meltano users* - the scheam override support delivered in Meltano v1.41.1 is more robust
// and should be preferred to this tap-specific override functionality.
"schema_overrides": {
"id": {
"type": ["null", "integer"],
},
// if you want the tap to enforce that a field is not nullable, you can do it like so:
"first_name": {
"type": "string",
}
}
},
{
"path": "sftp://username:password@host//path/file",
"name": "another_table_name",
"pattern": "subdir/.*User.*",
"start_date": "2017-05-01T00:00:00Z",
"key_properties": ["id"],
"format": "excel",
// you must specify the worksheet name to pull from in your xls(x) file.
"worksheet_name": "Names"
}
]
}
Each object in the 'tables' array describes one or more CSV or Excel spreadsheet files that adhere to the same schema and are meant to be tapped as the source for a Singer-based data flow.
number
be used anyway so that floats are not considered errors? The default is false but true can help in situations where floats only appear rarely in sources and may not be detected through discovery sampling.This is an experimental feature used to crawl a path and generate a config block for every file encountered. An intended use-case is where source files are organized in subdirectories by intended target table. This mode will generate a config block for each subdirectory and for each file format within it. The following example config file will crawl the s3 bucket my-example-bucket and produce config blocks for each folder under it where source files are detected.
{
"tables": [
{
"crawl_config": true,
"path": "s3://my-example-bucket",
"pattern": ".*"
}
]
}
Typically this mode will be used when there are many streams to be configured and processed. Therefore, generating the catalog independently is generally helpful.
meltano invoke --dump=catalog tap-spreadsheets-anywhere > my-catalog.json
meltano elt --catalog=my-catalog.json --job_id=my-job-state tap-spreadsheets-anywhere any-loader
JSON files are expected to parse as a root-level array of objects where each object is a set of flat key-value pairs.
[
{ "name": "row one", "key": 42},
{ "name": "row two", "key": 43}
]
JSONL files are expected to parse as one object per line, where each row in a file is a set of key-value pairs.
{ "name": "row one", "key": 42}
{ "name": "row two", "key": 43}
This tap authenticates with target systems as described in the smart_open documentation here.
This tap is designed to continually poll a configured directory for any unprocessed files that match a table configuration and to process any that are found. On the first syncing run, the declared start_date will be used to filter the set of files that match the search_prefix and pattern expressions. The last modified date of the most recently synced file will then be written to state and used in place of start_date on the next syncing run.
While state is maintained, only new files will be processed from subsequent runs.
First, make sure Python 3 is installed on your system. Then, execute create_virtualenv.sh
to create a local venv and install the necessary dependencies. If you are executing this tap outside of Meltano then you will need to supply the config.json file yourself. A sample configuration is available here sample_config.json
You can invoke this tap directly with:
python -m tap_spreadsheets_anywhere --config config.json
History:
Copyright © 2020 Eric Simmerman