cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link

Remake mart_ntd.dim_monthly_ntd_ridership_with_adjustments with new data #3519

Open csuyat-dot opened 3 weeks ago

csuyat-dot commented 3 weeks ago

User story / feature request

As a data analyst, I want mart_ntd.dim_monthly_ntd_ridership_with_adjustments remade as a new table that uses data from external_ntd__ridership.historical__complete_monthly_ridership_with_adjustments_and_estimates__upt

mart_ntd.dim_monthly_ntd_ridership_with_adjustments data is used in the Monthly NTD Ridership Report. Previously, this table was manually updated when FTA publishes new data from the complete monthly ridership report. However, the FTA report is in a wide format (a new column is added every month). dim_monthly_ntd_ridership_with_adjustment converted the FTA data to a long format which works well with the report site scripts.

The new external_ntd__ridership.historical__complete_monthly_ridership_with_adjustments_and_estimates__upt ingest data in initial the wide format, which is not compatible with the report site scripts.

Acceptance Criteria

A model as close as possible to initial dim_monthly_ntd_ridership_with_adjustments model where each rows is an agency's UPT for that month-year

Notes

erikamov commented 2 weeks ago

As discussed during weekly meeting, this work is important and urgent. So I moving to our current iteration to work on it.

erikamov commented 2 weeks ago

The URL for the xlsx file changes every month, so we will need someone to get the new URL and update it.

For new month_year columns on new files, currently we just need to add them to the yml file. Depending on the path we use for the URL change, this is just part of the changes on PR... see any new column and add them.

If we want it to be automatic, I could remove the columns definition, so any new columns would be imported. It would impact on the order of the columns that gets a little messy on External and Staging tables, also we cannot pre-define the typecast for each column the process created the typecast based on the data.

@evansiroky @csuyat-dot Would love to hear if you have any thoughts or preferences.

evansiroky commented 2 weeks ago

Cc @vevetron for comments too

erikamov commented 2 weeks ago

Some notes:

erikamov commented 2 weeks ago

Cc @vevetron for comments too

Thank you! I was going to add him and end up forgetting to tag him.

erikamov commented 2 weeks ago

@csuyat-dot @vevetron I have some questions about Mode...

The Intermediate tables are currently returning only records with the following modes:

AG, AR, CB, CC, CR, DR, FB, HR, IP, LR, MB, MG, MO, OR*, PB, RB, SR, TB, TR, VP, YR

The dim_monthly_ridership_with_adjustments is created from those intermediate tables, but it has the rules for modes:

QUESTION 1 For the mode OR the service_type = Unknown. There is no description for OR on ntd_modes_to_full_names.csv, so maybe this is the reason, but I want to confirm.

QUESTION 2 DT and JT are described as Demand Response on dim_monthly_ridership_with_adjustments but since those two are not part of the list on the intermediate tables they will never show up. Also there is no records on external tables with those modes. So can we remove DT and JT from that Case clause?

The existing modes on Complete Monthly Ridership with adjustments and estimates are: AG, AR, CB, CC, CR, DR, FB, HR, IP, LR, MB, MG, MO, OR, PB, RB, SR, TB, TR, VP, YR

QUESTION 3 There column year is used for anything? We don't have this same column on the new external monthly tables. Looks like this column year just displays 2024 that is the year of the file imported. So could be removed or getting the year from other available dates like _dt (date the data was imported).

csuyat-dot commented 2 weeks ago

@erikamov here are my thoughts, though im not too familiar with modes

Q1: there was an old thread on Slack regarding OR. Nobody could totally confirm what OR means, but we are guessing it means Other. When looking at the Complete Monthly Ridership with adjustments and estimates excel report, Mode: OR is accompanied by the column 3 mode and Other. So maybe change Unknown to Other to be more consistent?

Q2: im personally unsure about removing/keeping DT and JT. My gut feeling is to keep both in the rare chance an agency does use that mode.

Q3: i dont use that year column for my report site. I am unsure if year is used anywhere else, but others can weigh in on it.

erikamov commented 1 week ago

Note for future references:

External monthly ridership tables have some records with empty NTD_ID for modes: Reduced Reporters, Rolling 12-Month Sum, Rolling 12-Month Sum with Reduced and Rural Reporter Estimates, Rolling 12-Month Sum with Reduced Reporters, Rural Reporters, Subtotal with Reduced and Rural Reporter Estimates, Subtotal with Reduced Reporters, Total.

They are lines from the summary added to the bottom of each monthly data worksheet inside the Complete Monthly Ridership with adjustments and estimates xlsx file for estimated monthly industry totals for Rural reporters.

Image

So I am adding filters to staging tables to only return records where NTD_ID IS NOT NULL. Intermediate and Mart tables are based on those staging tables so they will be filtering out too.

erikamov commented 1 week ago

Remake completed:

Remaining work:

erikamov commented 3 days ago

Schedule changed and dropped tables. Waiting to check if the sync_ntd_data_xlsx DAG will run only next Monday.