cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Change calitp_extracted_at/calitp_deleted_at to timestamps (rather than dates) #1042

Closed lauriemerrell closed 2 years ago

lauriemerrell commented 2 years ago

I think we should introduce a concept of using timestamped data for when a static GTFS feed was downloaded in the daily pipeline. Since we are downloading the RT data more frequently, we should do a switch-over of validating with whatever static feed we have available in our downloaded archive at that certain point in time. The daily downloads occur at midnight UTC each night. So for example, if we're validating Feb 3 RT data we would use the downloaded static data from Feb 2 up until midnight UTC at which point, we would switch to validating with the Feb 3 static data.

This strategy should set us up well for any future plans we may make to ingest the static GTFS data at even more frequent intervals and begins to follow more of a streaming pattern than a batch processing pattern.

Originally posted by @evansiroky in https://github.com/cal-itp/data-infra/issues/902#issuecomment-1031831564

I think to accomplish the goal that Evan outlines above, we would need to change calitp_extracted_at and calitp_deleted_at to be timestamps, rather than dates, throughout the GTFS schedule pipeline.

This will be a nontrivial effort but as far as I can see is the correct way to handle this.

evansiroky commented 2 years ago

To be clear this is a very very long-term idea that is not yet ready for prime-time. We are still planning on maintaining and supporting the extracted_at / deleted_at paradigm for the foreseeable future. In a very limited amount of cases, typically when we are creating new code and DAGs we can attempt to be somewhat mindful about trying to orient the code around timestamps, but that should not be the focus at this time. To avoid any further confusion, I'm going to edit my comment to remove the bigger picture idea and close this issue until a later time when a larger refactor is needed if it ever is.