anelendata / tap-bigquery

Singer.io tap for extracting data from BigQuery tables
Apache License 2.0
16 stars 33 forks source link

when available use state.json bookmark instead of --start_datetime argument #12

Closed xarasco closed 4 years ago

xarasco commented 4 years ago

--start_datetime is a required argument.

Currently when it is provided, it overrides the last_update property value from state.json

{"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}
(tap-bigquery) ubuntu@ip-172-23-84-190:~$ ./scripts/systemchecker_bq.sh
INFO Running query:
    SELECT hash_id,processor,processor_cores,string_created_at,created_at,primary_resolution,memory,os,email,browser,netspeed,record_source FROM `dxdiag-aa08f.unioned.vw_system_checker` WHERE 1=1 AND datetime '2020-01-01 00:00:00.000000' <= CAST(created_at as datetime) AND CAST(created_at as datetime) < datetime '2020-09-09 07:07:52.550010' ORDER BY created_at
time=2020-09-09 07:07:54 name=target_snowflake level=INFO message=Table 'systemchecker_singer."SYSTEM_CHECKER"' does not exist. Creating...
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 9147, "tags": {"endpoint": "system_checker"}}
INFO Syncing stream:system_checker
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Uploading 9147 rows to external snowflake stage on S3
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Target S3 bucket: cfsnowflakestage, local file: /tmp/records_b51fywzm.csv, S3 key: prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Loading 9147 rows into 'systemchecker_singer."SYSTEM_CHECKER"'
time=2020-09-09 07:08:03 name=target_snowflake level=INFO message=Loading into systemchecker_singer."SYSTEM_CHECKER": {"inserts": 0, "updates": 9147, "size_bytes": 500703}
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Deleting prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv from external snowflake stage on S3
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Emitting state {"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}

if --start_datetime continues to be required, a value in state.json should override the --start_datetime argument value.

xarasco commented 4 years ago

Closing issue. If start_datetime becomes an optional argument, it would be better if --startdatetime overrides the state.json file I think.