Closed moltar closed 6 months ago
We apologize for the late response on this.
You are right that Athena is not a requisite component for using the Solution and running S3 directly into QuickSight is completely valid. There are a few reasons an organization may want to use Athena. One would be restricting data access for certain groups through Lake Formation, where individuals in these groups would use Athena as their means of viewing the data in S3 before building out any reporting. In addition, since the data lake is open and allows for adding new data sources outside of just AMC, it is not always the case that the data residing in S3 will be crunched ahead of time. New data sources may need additional querying before being pulled downstream.
These are just a couple examples of where Athena may be used.
Best, Andrew
Adding to this:
I found it very helpful to work with Athena Views instead of direct query to S3 before visualizing data.
Adding an example below. With this approach, the Athena view will always select the latest file in case there are multiple files for the same date range (or "time_window_start + time_window_end combination").
Furthermore, it removes results filtered due to the aggregation threshold from the result (assuming you named this column aggthr_filtered
).
CREATE OR REPLACE VIEW "select_latest_data_from_workflowtable" AS
With
--identify only the latest file for identical dates
select_latest_file as (
SELECT time_window_start, time_window_end, file_last_modified
FROM (
SELECT time_window_start, time_window_end, file_last_modified,
ROW_NUMBER() OVER (PARTITION BY time_window_start, time_window_end ORDER BY file_last_modified DESC) AS rn
FROM workflowtable_adhoc
) ranked
WHERE rn = 1)
SELECT desired_table.*
FROM workflowtable_adhoc desired_table
INNER JOIN select_latest_file slf
on desired_table.file_last_modified = slf.file_last_modified
WHERE (aggthr_filtered = false)
@davidbeckonline Thank you so much. You read my mind. We actually have this exact issue in the backlog to use Athena views to pick only the latest executions. Thanks for providing the working solution! 🎉
Thinking about this a little bit more, I am wondering whether the provided Athena query could be improved by reducing the timestamp info to the date. With this setup, it should also catch scenarios when a workflow ran multiple times for the same date range, but the users selected different time zones.
The function below assumes that the workflow you are running is taking advantage of adding parameters BUILT_IN_PARAMETER('TIME_WINDOW_START') as time_window_start
and BUILT_IN_PARAMETER('TIME_WINDOW_END') as time_window_end
as part of the final SELECT. Furthermore, this function assumes that you add "filteredMetricsDiscriminatorColumn": "aggthr_filtered"
to your workflow.
The AMC Insights on AWS solution will automatically add the file_last_modified
info.
CREATE OR REPLACE VIEW "workflow_table__view" AS
WITH
-- Define the main table reference
desired_aioa_table AS (
SELECT *
-- UPDATE
FROM workflow_table_adoc
),
-- Identify only the latest file for identical dates
select_latest_file AS (
SELECT
SUBSTRING(time_window_start, 1, 10) AS truncated_time_window_start,
SUBSTRING(time_window_end, 1, 10) AS truncated_time_window_end,
file_last_modified
FROM (
SELECT
time_window_start,
time_window_end,
file_last_modified,
ROW_NUMBER() OVER (
PARTITION BY SUBSTRING(time_window_start, 1, 10),
SUBSTRING(time_window_end, 1, 10)
ORDER BY file_last_modified DESC
) AS rn
FROM desired_aioa_table
) ranked
WHERE rn = 1
)
-- get the data from main table whereby we ignore data filtered by aggregation threshold
SELECT desired_table.*
FROM desired_aioa_table desired_table
INNER JOIN select_latest_file slf
ON desired_table.file_last_modified = slf.file_last_modified
WHERE desired_table.aggthr_filtered = false;
Hi, sorry for opening an issue, but there are no discussions enabled, nor a comment section on the solutions page.
I'm trying to understand the value Athena provides in this solution given that the data coming from AMC is not truly queryable, and most likely has already been crunched at the AMC side.
Is it purely for the Glue catalogue capability?
In our solution, we went direct S3 -> QuickSight ingestion, and am wondering if maybe I am missing a key step here.
Thanks.