cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
44 stars 12 forks source link

Bug: Amtrak feed seems difficult to filter to a specific service date #545

Open edasmalchi opened 2 years ago

edasmalchi commented 2 years ago

Describe the bug When attempting to filter for Thruway bus weekday service in CA, I seem to get way too many trips from the Amtrak GTFS feed. Each trip seems to have its own service_id, which may be complicating things.

To Reproduce Steps to reproduce the behavior:

df = (tbl.views.gtfs_schedule_fact_daily_service()
    >> filter(_.calitp_itp_id == 13)
    >> filter(_.service_date == '2021-10-28') ## a Thursday
    >> inner_join(_, tbl.gtfs_schedule.trips() >> filter(_.calitp_itp_id == 13), on = 'service_id')
    >> filter(_.route_id == '37329') ## this route_id seems to encompass many thruway bus trips in CA
    >> collect())
df

This Siuba code returns a df with what seems to be far too many trips for a weekday.

Also see this notebook, which provides additional context including a table showing individual stops served hundreds of times by thruway buses when in reality the number of daily trips is far lower.

Expected behavior An Amtrak feed filtered to actual weekday trips only.

Additional context See discussion on analysis issue

hunterowens commented 2 years ago

Just aloud here thinking they might have a very weird calendar implementation since services are often multiday

(ie, a train that leave LA takes 2-3 days to get to Seattle). Maybe @e-lo has thoughts?

On Fri, Oct 22, 2021 at 10:07 AM Eric Dasmalchi @.***> wrote:

Describe the bug When attempting to filter for Thruway bus weekday service in CA, I seem to get way too many trips from the Amtrak GTFS feed. Each trip seems to have its own service_id, which may be complicating things.

To Reproduce Steps to reproduce the behavior:

df = (tbl.views.gtfs_schedule_fact_daily_service()

filter(_.calitp_itpid == 13) filter(.service_date == '2021-10-28') ## a Thursday innerjoin(, tbl.gtfsschedule.trips() >> filter(.calitp_itp_id == 13), on = 'serviceid') filter(.route_id == '37329') ## this route_id seems to encompass many thruway bus trips in CA collect()) df

This Siuba code returns a df with what seems to be far too many trips for a weekday.

Also see this notebook https://github.com/cal-itp/data-analyses/blob/amtrak-thruway-validators/thruway_bus_validators/thruway_validators.ipynb, which provides additional context including a table showing individual stops served hundreds of times by thruway buses when in reality the number of daily trips is far lower.

Expected behavior An Amtrak feed filtered to actual weekday trips only.

Additional context See discussion on analysis issue <cal-itp/data-analyses#179>

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cal-itp/data-infra/issues/545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANHXYUQFOAEOHCFGVVB5UDUIGK65ANCNFSM5GQ75C7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

e-lo commented 2 years ago

From your notebook:

380 daily trips for route_id 37329 seems like an error. Also that ID seems to serve many routes?

I'd agree. Can you map the shapes associated with it? I think they sometimes use route_ids for lots of service and as a way of separating what service is provided by which bus operator...

evansiroky commented 2 years ago

What @e-lo said seems to make sense. In looking at the data, it appears that almost every California-based thruway service is coded as route 37329. So, if you ignore the same route issue, I don't think this is a bug.

edasmalchi commented 2 years ago

From your notebook:

380 daily trips for route_id 37329 seems like an error. Also that ID seems to serve many routes?

I'd agree. Can you map the shapes associated with it? I think they sometimes use route_ids for lots of service and as a way of separating what service is provided by which bus operator...

There appears to be no associated shape_id for route 37329 trips. I did map stops and it looks like... statewide Thruway bus service.

Screen Shot 2021-10-22 at 3 00 13 PM

What @e-lo said seems to make sense. In looking at the data, it appears that almost every California-based thruway service is coded as route 37329. So, if you ignore the same route issue, I don't think this is a bug.

Agree that coding all this service as 37329 isn't necessarily a bug. The unreasonably high trip count when I attempt to filter for Thursday service does seem to be though. While some stops seem reasonable (21 trips/day at LA, 32 trips/day at Bakersfield), I don't think, say 92 trips/day at Solvang or 274 trips/day at San Jose reflects reality.

route_id | stop_id | n
-- | -- | --
37329 | SJC | 274
37329 | SLV | 92
37329 | BFD | 32
37329 | LAX | 21