Open csuyat-dot opened 3 weeks ago
As discussed during weekly meeting, this work is important and urgent. So I moving to our current iteration to work on it.
The URL for the xlsx file changes every month, so we will need someone to get the new URL and update it.
For new month_year columns on new files, currently we just need to add them to the yml file. Depending on the path we use for the URL change, this is just part of the changes on PR... see any new column and add them.
If we want it to be automatic, I could remove the columns definition, so any new columns would be imported. It would impact on the order of the columns that gets a little messy on External and Staging tables, also we cannot pre-define the typecast for each column the process created the typecast based on the data.
@evansiroky @csuyat-dot Would love to hear if you have any thoughts or preferences.
Cc @vevetron for comments too
Some notes:
DAG runs: schedule_interval: "0 10 *" # 10am UTC every day We don't need it to run everyday, so I think we can change it to once a week. Generating less data on the bucket and less time on the daily external table update.
Weird difference between NTD ID and Legacy NTD ID after the scrape process If NTD ID set to string on yml file it converts to "1.0" instead of "00001" like in I see on the xlsx file. Then to format we need to cast to NUMERIC and then to INT: format("%05d", CAST(CAST(ntd_id AS NUMERIC) AS INT64))
If NTD ID set to NUMERIC on yml file it converts correctly to "1" and we can use only one CAST AS INT64.
The Legacy NTD works correctly, converting to "0001", maybe because there are some cases with string values, then the conversion works correctly. I think the problem is on excel conversion, will create an issue for that and for date types that I noticed a problem when working on Agency Information.
Even though the information bellow come from the Complete Monthly Ridership with adjustments and estimates
file, it contains yearly information:
Cc @vevetron for comments too
Thank you! I was going to add him and end up forgetting to tag him.
@csuyat-dot @vevetron I have some questions about Mode
...
The Intermediate tables are currently returning only records with the following modes:
AG
, AR
, CB
, CC
, CR
, DR
, FB
, HR
, IP
, LR
, MB
, MG
, MO
, OR
*, PB
, RB
, SR
, TB
, TR
, VP
, YR
The dim_monthly_ridership_with_adjustments is created from those intermediate tables, but it has the rules for modes:
AG
, AR
, CB
, CC
, CR
, FB
, HR
, IP
, IP
, LR
, MB
, MG
, MO
, RB
, SR
, TB
, TR
, YR
) the service_type = Fixed Route
DR
, DT
, VP
, JT
, PB
) the service_type
= Demand Response
service_type = Unknown
QUESTION 1 For the mode OR
the service_type = Unknown
.
There is no description for OR
on ntd_modes_to_full_names.csv, so maybe this is the reason, but I want to confirm.
QUESTION 2 DT
and JT
are described as Demand Response
on dim_monthly_ridership_with_adjustments
but since those two are not part of the list on the intermediate tables they will never show up.
Also there is no records on external tables with those modes. So can we remove DT
and JT
from that Case
clause?
The existing modes on Complete Monthly Ridership with adjustments and estimates
are:
AG
, AR
, CB
, CC
, CR
, DR
, FB
, HR
, IP
, LR
, MB
, MG
, MO
, OR
, PB
, RB
, SR
, TB
, TR
, VP
, YR
QUESTION 3 There column year is used for anything?
We don't have this same column on the new external monthly tables.
Looks like this column year
just displays 2024 that is the year of the file imported. So could be removed or getting the year from other available dates like _dt
(date the data was imported).
@erikamov here are my thoughts, though im not too familiar with modes
Q1: there was an old thread on Slack regarding OR
. Nobody could totally confirm what OR means, but we are guessing it means Other
. When looking at the Complete Monthly Ridership with adjustments and estimates
excel report, Mode: OR
is accompanied by the column 3 mode
and Other
. So maybe change Unknown
to Other
to be more consistent?
Q2: im personally unsure about removing/keeping DT
and JT
. My gut feeling is to keep both in the rare chance an agency does use that mode.
Q3: i dont use that year
column for my report site. I am unsure if year
is used anywhere else, but others can weigh in on it.
Note for future references:
External monthly ridership tables have some records with empty NTD_ID for modes: Reduced Reporters
, Rolling 12-Month Sum
, Rolling 12-Month Sum with Reduced and Rural Reporter Estimates
, Rolling 12-Month Sum with Reduced Reporters
, Rural Reporters
, Subtotal with Reduced and Rural Reporter Estimates
, Subtotal with Reduced Reporters
, Total
.
They are lines from the summary added to the bottom of each monthly data worksheet inside the Complete Monthly Ridership with adjustments and estimates
xlsx file for estimated monthly industry totals for Rural reporters.
So I am adding filters to staging tables to only return records where NTD_ID IS NOT NULL
.
Intermediate and Mart tables are based on those staging tables so they will be filtering out too.
Remake completed:
dim_monthly_NTD_ridership_with_adjustments
was renamed to dim_monthly_ridership_with_adjustments
dim_monthly_ridership_with_adjustments
is now using the new NTD ridership tablesRemaining work:
sync_ntd_data_xlsx
schedule to once a week instead of every daySchedule changed and dropped tables. Waiting to check if the sync_ntd_data_xlsx DAG will run only next Monday.
User story / feature request
As a data analyst, I want
mart_ntd.dim_monthly_ntd_ridership_with_adjustments
remade as a new table that uses data fromexternal_ntd__ridership.historical__complete_monthly_ridership_with_adjustments_and_estimates__upt
mart_ntd.dim_monthly_ntd_ridership_with_adjustments
data is used in the Monthly NTD Ridership Report. Previously, this table was manually updated when FTA publishes new data from the complete monthly ridership report. However, the FTA report is in a wide format (a new column is added every month).dim_monthly_ntd_ridership_with_adjustment
converted the FTA data to a long format which works well with the report site scripts.The new
external_ntd__ridership.historical__complete_monthly_ridership_with_adjustments_and_estimates__upt
ingest data in initial the wide format, which is not compatible with the report site scripts.Acceptance Criteria
A model as close as possible to initial dim_monthly_ntd_ridership_with_adjustments model where each rows is an agency's UPT for that month-year
Notes