Open tiffanychu90 opened 1 day ago
Thanks for the thorough writeup! My first impression is that going back to the last cached date would be preferable but happy to help brainstorm more.
Stuff like stops/routes are relatively static, and it seems better to have complete data for an operator minus, perhaps, the most recent service change vs. no/incomplete data... For RT, maybe better to have it go a few months stale vs. running an off-cycle date?
Perhaps as part of this tooling we can add a separate alert/reporting mechanism if we have nothing for an operator for, say, 6mos?
@edasmalchi: Ok! let me try to get this for sep open data + yaml produced to track what's there, and we can iterate from there? I'm curious for how many operators / how far back we'll be patching this, but hopefully this means sep's hqta data will definitely have Long Beach
Where does your feature apply? Select from the below, and be sure to affix the appropriate label to this issue (e.g.
dataset
,jupyterhub
,metabase
,analysis.calitp.org
)Is your feature request related to a problem? Please describe. Our single day snapshots that support our analytics pipeline can be subject to missing operators. This is expected, as day to day, feeds can be missing for a short period and come back soon thereafter. For users, this can prove to be frustrating as operators appear and disappear.
Describe the solution you'd like We'll keep our analytics pipeline as is, pulling the single day and running it through. Except, let's add 2 things to help us fill in the blanks:
schedule_gtfs_dataset_name
and (last available)analysis_date
. use this to check to see if we're missing anyone...and if we are, we can pull from an earlier cached date of the processed results.dataset_name_date
, and now we'd have a version that isdataset_name_date(patched)
.Describe alternatives you've considered
We want to consider the following points:
shared_utils/rt_dates
as the list of all dates we support with all the intermediate outputs ingtfs_analytics_data.yaml
saved.gtfs_analytics_data.yml
data catalog is to know which dates are fully supported across all the analytics work, and that we can combine all those sources easily for a given dayAdditional context