Auestion about Athena - Githubissues

moltar commented 10 months ago

Hi, sorry for opening an issue, but there are no discussions enabled, nor a comment section on the solutions page.

I'm trying to understand the value Athena provides in this solution given that the data coming from AMC is not truly queryable, and most likely has already been crunched at the AMC side.

Is it purely for the Glue catalogue capability?

In our solution, we went direct S3 -> QuickSight ingestion, and am wondering if maybe I am missing a key step here.

Thanks.

andrewmarriott-aws commented 3 months ago

We apologize for the late response on this.

You are right that Athena is not a requisite component for using the Solution and running S3 directly into QuickSight is completely valid. There are a few reasons an organization may want to use Athena. One would be restricting data access for certain groups through Lake Formation, where individuals in these groups would use Athena as their means of viewing the data in S3 before building out any reporting. In addition, since the data lake is open and allows for adding new data sources outside of just AMC, it is not always the case that the data residing in S3 will be crunched ahead of time. New data sources may need additional querying before being pulled downstream.

These are just a couple examples of where Athena may be used.

Best, Andrew

davidbeckonline commented 3 months ago

Adding to this:

I found it very helpful to work with Athena Views instead of direct query to S3 before visualizing data.

Adding an example below. With this approach, the Athena view will always select the latest file in case there are multiple files for the same date range (or "time_window_start + time_window_end combination").

Furthermore, it removes results filtered due to the aggregation threshold from the result (assuming you named this column aggthr_filtered).

CREATE OR REPLACE VIEW "select_latest_data_from_workflowtable" AS 

With
--identify only the latest file for identical dates
select_latest_file as (
    SELECT time_window_start, time_window_end, file_last_modified
    FROM (
        SELECT time_window_start, time_window_end, file_last_modified,
               ROW_NUMBER() OVER (PARTITION BY time_window_start, time_window_end ORDER BY file_last_modified DESC) AS rn
        FROM workflowtable_adhoc
    ) ranked
    WHERE rn = 1)

SELECT desired_table.*
FROM workflowtable_adhoc desired_table
    INNER JOIN select_latest_file slf
        on desired_table.file_last_modified = slf.file_last_modified
    WHERE (aggthr_filtered = false)

moltar commented 3 months ago

@davidbeckonline Thank you so much. You read my mind. We actually have this exact issue in the backlog to use Athena views to pick only the latest executions. Thanks for providing the working solution! 🎉

davidbeckonline commented 3 months ago

Thinking about this a little bit more, I am wondering whether the provided Athena query could be improved by reducing the timestamp info to the date. With this setup, it should also catch scenarios when a workflow ran multiple times for the same date range, but the users selected different time zones.

The function below assumes that the workflow you are running is taking advantage of adding parameters BUILT_IN_PARAMETER('TIME_WINDOW_START') as time_window_start and BUILT_IN_PARAMETER('TIME_WINDOW_END') as time_window_end as part of the final SELECT. Furthermore, this function assumes that you add "filteredMetricsDiscriminatorColumn": "aggthr_filtered" to your workflow.

The AMC Insights on AWS solution will automatically add the file_last_modified info.

CREATE OR REPLACE VIEW "workflow_table__view" AS 

WITH
-- Define the main table reference
desired_aioa_table AS (
    SELECT *
    -- UPDATE
    FROM workflow_table_adoc
),

-- Identify only the latest file for identical dates
select_latest_file AS (
    SELECT 
        SUBSTRING(time_window_start, 1, 10) AS truncated_time_window_start, 
        SUBSTRING(time_window_end, 1, 10) AS truncated_time_window_end, 
        file_last_modified
    FROM (
        SELECT 
            time_window_start, 
            time_window_end, 
            file_last_modified,
            ROW_NUMBER() OVER (
                PARTITION BY SUBSTRING(time_window_start, 1, 10), 
                             SUBSTRING(time_window_end, 1, 10) 
                ORDER BY file_last_modified DESC
            ) AS rn
        FROM desired_aioa_table
    ) ranked
    WHERE rn = 1
)

-- get the data from main table whereby we ignore data filtered by aggregation threshold
SELECT desired_table.*
FROM desired_aioa_table desired_table
INNER JOIN select_latest_file slf
    ON desired_table.file_last_modified = slf.file_last_modified
WHERE desired_table.aggthr_filtered = false;

aws-solutions / amazon-marketing-cloud-insights-on-aws

Auestion about Athena #4