UrbanAnalyst / gtfsrouter

Routing and analysis engine for GTFS (General Transit Feed Specification) data
https://urbananalyst.github.io/gtfsrouter/
82 stars 17 forks source link

Crash with specified date #31

Closed polettif closed 4 years ago

polettif commented 4 years ago

Running the following code with a fairly simple feed crashes RStudio:

# devtools::install_github("ATFutures/gtfs-router")
library(gtfsrouter)
download.file("https://github.com/polettif/gtfs-test-feeds/raw/master/zip/routing.zip", "routing.zip")
gtfs = extract_gtfs("routing.zip")
timetbl <- gtfs_timetable(gtfs, date = 20181001)

Screenshot

I can't really tell where the issue is, transfers.txt looks like this:

from_stop_id,to_stop_id,transfer_type,min_transfer_time
stop1a,stop1b,2,10
stop1b,stop1a,2,10
stop3a,stop3b,2,10
stop3b,stop3a,2,10
stop8a,stop8b,2,10
stop8b,stop8a,2,10

Or maybe it's an issue with the date. I don't know how dates are extracted in gtfsrouter. In my understanding of gtfs_timetable's doc, the date parameter is only applied to calendar_dates.txt since

Some systems do not specify days of the week within their 'calendar' table; rather they provide full timetables for specified calendar dates via a 'calendar_date' table.

Compare to tidytransit's approach where a date_service_table is calculated to see which services (and thus trips) run on which date.

# install.packages("tidytransit")
library(tidytransit)
gtfs = read_gtfs("https://github.com/polettif/gtfs-test-feeds/raw/master/zip/routing.zip")
gtfs$calendar
#> # A tibble: 4 x 10
#>   service_id monday tuesday wednesday thursday friday saturday sunday start_date
#>   <chr>       <int>   <int>     <int>    <int>  <int>    <int>  <int> <date>    
#> 1 WEEK            1       1         1        1      1        0      0 2018-10-01
#> 2 EXPR            0       0         0        0      0        0      0 2018-10-01
#> 3 WEND            0       0         0        0      0        1      1 2018-10-01
#> 4 EMPT            1       0         1        1      0        1      0 2018-10-01
#> # … with 1 more variable: end_date <date>
gtfs$calendar_dates
#> # A tibble: 5 x 3
#>   service_id date       exception_type
#>   <chr>      <date>              <int>
#> 1 WEEK       2018-10-06              2
#> 2 WEEK       2018-10-07              2
#> 3 EXPR       2018-10-05              1
#> 4 EMPT       2018-10-02              1
#> 5 EMPT       2018-10-01              2

gtfs <- set_date_service_table(gtfs)
gtfs$.$date_service_table
#> # A tibble: 586 x 2
#>    date       service_id
#>    <date>     <chr>     
#>  1 2018-10-01 WEEK      
#>  2 2018-10-02 WEEK      
#>  3 2018-10-03 WEEK      
#>  4 2018-10-03 EMPT      
#>  5 2018-10-04 WEEK      
#>  6 2018-10-04 EMPT      
#>  7 2018-10-05 WEEK      
#>  8 2018-10-06 WEND      
#>  9 2018-10-06 EMPT      
#> 10 2018-10-07 WEND      
#> # … with 576 more rows
filtered_stop_times = filter_stop_times(gtfs, "2018-10-01", 0, 24*3600)

Created on 2020-06-25 by the reprex package (v0.3.0)

mpadge commented 4 years ago

Thanks @polettif - i can confirm reproducibility of that system meltdown. Shall fix asap

mpadge commented 4 years ago

Thanks @polettif - that just happened because the only services listed in the trips table are for service_id = "WEEK", but that date filters to service_id == "EMPT", giving an empty trip table. This was not anticipated, but was passed through to C++ code which expected some kind of non-NULL object, and so crashed.

library (gtfsrouter)
if (!file.exists ("routing.zip"))
    download.file("https://github.com/polettif/gtfs-test-feeds/raw/master/zip/routing.zip",
                  "routing.zip")
gtfs = extract_gtfs("routing.zip")
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds

gtfs$calendar_dates
#>    service_id     date exception_type
#> 1:       WEEK 20181006              2
#> 2:       WEEK 20181007              2
#> 3:       EXPR 20181005              1
#> 4:       EMPT 20181002              1
#> 5:       EMPT 20181001              2 ### <=== This is the date entered in the call below
gtfs$trips # only has trips for 'service_id == "WEEK"'
#>    route_id service_id trip_id
#> 1:    lineA       WEEK routeA1
#> 2:    lineA       WEEK routeA2
#> 3:    lineB       WEEK  routeB
#> 4:    lineC       WEEK  routeC
#> 5:    lineD       WEEK routeD1
#> 6:    lineD       WEEK routeD2

timetbl <- gtfs_timetable(gtfs, date = 20181001)
#> Error in filter_by_date(gtfs_cp, date): The date restricts service_ids to [EMPT] yet there are not trips for those service_ids
timetbl <- gtfs_timetable(gtfs, date = 20181006) # but that works

Created on 2020-08-13 by the reprex package (v0.3.0)

polettif commented 4 years ago

Well, EMPT doesn't run Monday 2018-10-01 since its removed with exception type 2 in calendar_dates.txt. However, WEEK runs on said date (defined in calendar.txt) and it includes all trips in the feed:

library(tidytransit)

g = read_gtfs("https://github.com/polettif/gtfs-test-feeds/raw/master/zip/routing.zip")

g$calendar
#> # A tibble: 4 x 10
#>   service_id monday tuesday wednesday thursday friday saturday sunday start_date
#>   <chr>       <int>   <int>     <int>    <int>  <int>    <int>  <int> <date>    
#> 1 WEEK            1       1         1        1      1        0      0 2018-10-01
#> 2 EXPR            0       0         0        0      0        0      0 2018-10-01
#> 3 WEND            0       0         0        0      0        1      1 2018-10-01
#> 4 EMPT            1       0         1        1      0        1      0 2018-10-01
#> # … with 1 more variable: end_date <date>

g$trips
#> # A tibble: 6 x 3
#>   route_id service_id trip_id
#>   <chr>    <chr>      <chr>  
#> 1 lineA    WEEK       routeA1
#> 2 lineA    WEEK       routeA2
#> 3 lineB    WEEK       routeB 
#> 4 lineC    WEEK       routeC 
#> 5 lineD    WEEK       routeD1
#> 6 lineD    WEEK       routeD2

I don't know how services and dates are handled in gtfsrouter but IMO there's no way around creating a table from calendar and calendar_dates that links dates and service_ids. set_date_service_table does this for tidytransit and is used in filter_stop_times:

library(tidytransit)

g = read_gtfs("https://github.com/polettif/gtfs-test-feeds/raw/master/zip/routing.zip")
g <- set_date_service_table(g)
stop_times = filter_stop_times(g, "2018-10-01", 0, 24*3600)

head(stop_times[,1:5])
#>    trip_id arrival_time departure_time stop_id stop_sequence
#> 1: routeA1     07:00:00       07:00:00  stop1a             1
#> 2: routeA1     07:04:00       07:05:00   stop2             2
#> 3: routeA1     07:11:00       07:12:00  stop3a             3
#> 4: routeA1     07:40:00       07:40:00   stop4             4
#> 5: routeA2     07:05:00       07:05:00  stop1a             1
#> 6: routeA2     07:09:00       07:10:00   stop2             2

Created on 2020-08-13 by the reprex package (v0.3.0)

mpadge commented 4 years ago

Oh, that's easy - just required processing the 2 different exception_type values. Above commit now does that, with the following result:

library(gtfsrouter)
if (!file.exists ("routing.zip"))
        download.file("https://github.com/polettif/gtfs-test-feeds/raw/master/zip/routing.zip", "routing.zip")
gtfs = extract_gtfs("routing.zip")
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds
timetbl <- gtfs_timetable(gtfs, date = 20181001)
head (timetbl$timetable)
#>    departure_station arrival_station departure_time arrival_time trip_id
#> 1:                 2               4          25200        25440       1
#> 2:                 4               6          25500        25860       1
#> 3:                 2               4          25500        25740       2
#> 4:                 2               9          25800        26100       3
#> 5:                 4               6          25800        26160       2
#> 6:                 6               8          25920        27600       1

Created on 2020-08-13 by the reprex package (v0.3.0)

Do you know of any feeds which use exception_type = 1 in calendar_dates? I'm not sure that would be appropriately handled by current code, but hard to know without an actual example of how that could be used - my guess is that that flag can only be meaningfully used to remove a bunch of services via exception_type = 2, and then that add back some specific ones via exception_type = 1. (Your example code is just a toy, and does not have entries in the trips table for the service_id values you've got in calendar and calendar_dates - real feeds which use those must have corresponding trips entries.)

polettif commented 4 years ago

Do you know of any feeds which use exception_type = 1 in calendar_dates?

This is a example: https://transitfeeds.com/p/reseau-de-transport-de-la-capitale/40 There are some feeds that only have calendar_dates with all the dates specified and no calendar. I haven't worked with one personally but issues came up in another project ([1], [2]). These feeds normally use exception_type=1.

(Your example code is just a toy, and does not have entries in the trips table for the service_id values you've got in calendar and calendar_dates - real feeds which use those must have corresponding trips entries.)

You're absolutely right, I missed that no trips for EXPR, WEND and EMPT lead to an invalid feed. However, I'd prefer to call it "test" instead of "toy" ;) I don't want to tell you how to implement date handling (might have sounded that way, sorry) I just want to highlight possible pitfalls.

mpadge commented 4 years ago

all good - i really appreciate your help, and shall check out that example feed asap. Thanks for suggesting it! (And yeah, "test" is better than "toy" - sorry about my sloppy terminology there). I'll re-open this issue to ensure the code appropriately handles all possible calendar <-> calendar_date combinations

mpadge commented 4 years ago

Thanks @polettif, that example seems to all work as expected with calendar_dates:

library(gtfsrouter)
gtfs <- extract_gtfs ("./rtc-gtfs.zip") # Quebec
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed
#> Warning in data.table::fread(flist[f], integer64 = "character", showProgress =
#> FALSE): Found and resolved improper quoting in first 100 rows. If the fields are
#> not quoted (e.g. field separator does not appear within any field), try quote=""
#> to avoid this warning.
#> Warning in data.table::fread(flist[f], integer64 = "character", showProgress =
#> FALSE): Detected 1 column names but the data has 2 columns (i.e. invalid file).
#> Added 1 extra default column name for the first column which is guessed to be
#> row names or an index. Use setnames() afterwards if this guess is not correct,
#> or fix the file write command that created the file to create a valid file.
#> ✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds

gtfs$calendar_dates [gtfs$calendar_dates$date == 20200201, ]
#>                  service_id     date exception_type
#> 1: 20200511multiint-0000010 20200201              1
# that gives the service_id for that calendar date

gtfs <- gtfs_timetable (gtfs) # errors as expected
#> Error: This appears to be a GTFS feed which uses a 'calendar_dates' table instead of 'calendar'.
#> Please first construct timetable for a particular date using 'gtfs_timetable(gtfs, date = <date>)'
#> See https://developers.google.com/transit/gtfs/reference/#calendar_datestxt for details.

gtfs <- gtfs_timetable (gtfs, date = 20200201)
gtfs$timetable
#>         departure_station arrival_station departure_time arrival_time trip_id
#>      1:               194            1910          18420        18480    2483
#>      2:              1910             217          18480        18480    2483
#>      3:               217             218          18480        18540    2483
#>      4:               218             219          18540        18600    2483
#>      5:               219             220          18600        18600    2483
#>     ---                                                                      
#> 124909:              4327            4329         101340       101460    1255
#> 124910:              4329            4331         101460       101520    1255
#> 124911:              4331            4104         101520       101580    1255
#> 124912:              4104            4105         101580       101640    1255
#> 124913:              4105            4189         101640       101700    1255
# timetable works

gtfs$trip_ids
#>                                trip_ids
#>    1: 66951554-20200511multiint-0000010
#>    2: 66950989-20200511multiint-0000010
#>    3: 66951027-20200511multiint-0000010
#>    4: 66951047-20200511multiint-0000010
#>    5: 66951011-20200511multiint-0000010
#>   ---                                  
#> 2527: 66950141-20200511multiint-0000010
#> 2528: 66951006-20200511multiint-0000010
#> 2529: 67140259-20200511multiint-0000010
#> 2530: 66952176-20200511multiint-0000010
#> 2531: 66951501-20200511multiint-0000010
# all trip_ids are of the specified service given above

Created on 2020-08-17 by the reprex package (v0.3.0)

I think that suffices to close this issue for now.