cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link

Investigate / replace GTFS RT urls that fail to download #868

Closed machow closed 2 years ago

machow commented 2 years ago

cc @evansiroky @hunterowens @holly-g , from pairing with @Nkdiaz -- here's a query that pulls out distinct error messages, with columns for itp id and url number.

Note the following:

Here are itp ids and error counts for them

row itp_id url_number n_counts
1 183 0 2  
2 194 0 1  
3 200 0 2  
4 203 0 1  
5 217 0 26088  
6 235 0 26088  
7 257 0 26088  
8 269 0 45  
9 279 0 3  
10 281 0 2  
11 295 0 3  
12 350 0 1  
13 4 0 26088  
14 45 0 1  
15 98 0 17155

Here is the query we used, which contains full messages + urls:

WITH 
download_issues AS (
    SELECT
        textPayload,
        timestamp,
        REGEXP_EXTRACT(textPayload, "INFO:/gtfs-rt-archive.py:fetcher ([0-9]+)") AS itp_id,
        REGEXP_EXTRACT(textPayload, "INFO:/gtfs-rt-archive.py:fetcher [0-9]+/([0-9]+)") AS url_number
    -- note that we've moved the logs to gtfs_rt_logs.stdout, since the table name can't be changed
    -- but using this table for now, since it holds full data for Dec 14th
    FROM `cal-itp-data-infra.gtfs_rt.stdout`
    WHERE textPayload LIKE "%error fetching url%"
)

SELECT
    textPayload,
    itp_id,
    url_number,
    COUNT(*) AS n_counts
FROM download_issues
WHERE 
    DATE(timestamp) BETWEEN "2021-12-13" AND "2021-12-15"
GROUP BY 1, 2, 3
machow commented 2 years ago

@nkdiaz let's put a table of all the errors caught by this sql code into gtfs_rt_views, to make a table named gtfs_rt_extraction_errors. Rather than counts, let's just keep the raw rows, so people can drilldown from summaries in metabase.

machow commented 2 years ago

We tested out two tables in metabase --

Count of all errors by day (named GTFS RT || Feed Extraction Errors by Day)

Count of distinct errors by day (named GTFS RT || Distinct Feed Extraction Errors by Day)

Nkdiaz commented 2 years ago

GTFSRT-extraction-by-day GTFSRT-distinct-extraction-by-day

evansiroky commented 2 years ago

This issue is somewhat of a duplicate to #536 which has been resolved. As a byproduct of this analysis, #879 was created, so this can be closed.