datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.76k stars 2.88k forks source link

Snowflake Dataset LastUpdatedTimestamp from operational_data_for_time_window #8843

Closed drew-vz closed 10 months ago

drew-vz commented 1 year ago

When include_usage_stats: True and self.config.include_operational_stats: True in a Snowflake ingestion it runs the logic forthis self._get_snowflake_history() -> self._make_operations_query() -> SnowflakeQuery.operational_data_for_time_window , which in turn emits the operation_aspect containing a new LastUpdatedTimestamp. The issue is that the query for this hits the Account Usage which is proven to have sometimes significant lag. The query is also seemingly much more complex than it needs to be. This lag is causing the LastUpdated in the UI to be far behind what is should be. End users are concerned their data is stale. We ingest right after we finish loading to the snowflake tables. So the lastUpdatedTimestamp should reflect that.

To Reproduce When include_usage_stats: True and self.config.include_operational_stats: True in a Snowflake ingestion

Expected behavior We would expect the OperationAspect emited by the get_snowflake_history query result to show when that table was last updated, not a lagged value that is misrepresentative.

We have had to implement a custom pipeline that emits the Operation Aspect from our ETL and set include_operational_stats: False ourselves so it is up to date, as the snowflake ingestion was not doing so properly

This is the query we are concerned about... why cant it just hit information schema, and get the last_updated_at, which is much more reliable... SELECT -- access_history.query_id, -- only for debugging purposes access_history.query_start_time AS "QUERY_START_TIME", query_history.query_text AS "QUERY_TEXT", query_history.query_type AS "QUERY_TYPE", query_history.rows_inserted AS "ROWS_INSERTED", query_history.rows_updated AS "ROWS_UPDATED", query_history.rows_deleted AS "ROWS_DELETED", access_history.base_objects_accessed AS "BASE_OBJECTS_ACCESSED", access_history.direct_objects_accessed AS "DIRECT_OBJECTS_ACCESSED", -- when dealing with views, direct objects will show the view while base will show the underlying table access_history.objects_modified AS "OBJECTS_MODIFIED", -- query_history.execution_status, -- not really necessary, but should equal "SUCCESS" -- query_history.warehouse_name, access_history.user_name AS "USER_NAME", users.first_name AS "FIRST_NAME", users.last_name AS "LAST_NAME", users.display_name AS "DISPLAY_NAME", users.email AS "EMAIL", query_history.role_name AS "ROLE_NAME" FROM snowflake.account_usage.access_history access_history LEFT JOIN ( SELECT * FROM snowflake.account_usage.query_history WHERE query_history.start_time >= to_timestamp_ltz({start_time_millis}, 3) AND query_history.start_time < to_timestamp_ltz({end_time_millis}, 3) ) query_history ON access_history.query_id = query_history.query_id LEFT JOIN snowflake.account_usage.users users ON access_history.user_name = users.name WHERE query_start_time >= to_timestamp_ltz({start_time_millis}, 3) AND query_start_time < to_timestamp_ltz({end_time_millis}, 3) AND access_history.objects_modified is not null AND ARRAY_SIZE(access_history.objects_modified) > 0 ORDER BY query_start_time DESC

HHQuraishi commented 1 year ago

Can you please suggest some beginner friendly issues.

drew-vz commented 1 year ago

@HHQuraishi Im sorry Im not sure I follow what you mean.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] commented 10 months ago

This issue was closed because it has been inactive for 30 days since being marked as stale.