UrbanAnalyst / gtfsrouter

Routing and analysis engine for GTFS (General Transit Feed Specification) data
https://urbananalyst.github.io/gtfsrouter/
80 stars 17 forks source link

extract_gtfs guessing column types? #74

Closed dhersz closed 3 years ago

dhersz commented 3 years ago

Hello @mpadge, I decided that opening this issue would be better than to keep this topic alive in our email exchange, which would make for some "crossed" conversation.

After you mentioned that I wasn't making some necessary conversions when reading the GTFS files (a very fair comment, since I indeed don't convert times to seconds and don't handle dates as integers, like you do here) I noticed that using waldo::compare() led to some weird comparisons between the objects generated by our reading functions. I think that's because some columns do not have their data types specified in extract_gtfs(). I'll paste the gist I produced below:

P.S. I'm not sure if this behaviour is desired. If it is, please ignore this issue.

For this demonstration I’ll use a dataset included in gtfstools.

    library(gtfstools)
    library(gtfsrouter)

    data_path <- system.file("extdata/poa_gtfs.zip", package = "gtfstools")

    gttools <- read_gtfs(data_path)
    grouter <- extract_gtfs(data_path)

    waldo::compare(gttools, grouter, x_arg = "gttools", y_arg = "grouter")

    ## `class(gttools)`: "dt_gtfs"       
    ## `class(grouter)`: "gtfs"    "list"
    ## 
    ## `attr(gttools, 'validation_result')` is an S3 object of class <data.table/data.frame>
    ## `attr(grouter, 'validation_result')` is absent
    ## 
    ## `attr(gttools, 'filtered')` is absent
    ## `attr(grouter, 'filtered')` is a logical vector (FALSE)
    ## 
    ## `gttools$agency$agency_phone` is a character vector ('156')
    ## `grouter$agency$agency_phone` is an integer vector (156)
    ## 
    ## `gttools$calendar$start_date` is an S3 object of class <Date>
    ## `grouter$calendar$start_date` is an integer vector (20190415, 20190415, 20190415, 20190415, 20190415, ...)
    ## 
    ## `gttools$calendar$end_date` is an S3 object of class <Date>
    ## `grouter$calendar$end_date` is an integer vector (20190715, 20190715, 20190715, 20190715, 20190715, ...)
    ## 
    ## `gttools$calendar_dates$date` is an S3 object of class <Date>
    ## `grouter$calendar_dates$date` is an integer vector (20190419, 20190419, 20190419, 20190419, 20190419, ...)
    ## 
    ## `gttools$feed_info$feed_start_date` is an S3 object of class <Date>
    ## `grouter$feed_info$feed_start_date` is an integer vector (20190411)
    ## 
    ## `gttools$feed_info$feed_end_date` is an S3 object of class <Date>
    ## `grouter$feed_info$feed_end_date` is an integer vector (20190711)
    ## 
    ## `gttools$routes$route_desc` is a character vector ('', '', '', '', '', ...)
    ## `grouter$routes$route_desc` is a logical vector (NA, NA, NA, NA, NA, ...)
    ## 
    ## And 12 more differences ...

We can see that routes$route_desc is read as a character vector full of empty entrances (i.e. "") by read_gtfs(), while it’s read as a logical vector full of NAs by extract_gtfs(). agency$agency_phone is also read as a character by read_gtfs(), while it’s read as an integer by extract_gtfs(). According to the official GTFS reference, both this fields should be read as strings.

In fact, if I change these fields just a tad bit, extract_gtfs() reads it differently, which leads me to believe that it’s using fread() (or whichever function used to read these files) to guess the fields' types, instead of explicitly setting how they should be read.

To exemplify it, I’ll manually edit the poa_gtfs.zip file:

As you can see below, using extract_gtfs() to import the GTFS leads to different column types:

    poa2_path <- "./poa_gtfs2.zip"
    poa3_path <- "./poa_gtfs3.zip"

    grouter2 <- extract_gtfs(poa2_path)
    grouter3 <- extract_gtfs(poa3_path)

    grouter2$routes$route_desc[1:5]

    ## [1] "oi" ""   ""   ""   ""

    class(grouter2$routes$route_desc)

    ## [1] "character"

    grouter3$routes$route_desc[1:5]

    ## [1]  3 NA NA NA NA

    class(grouter3$routes$route_desc)

    ## [1] "integer"

    grouter$routes$route_desc[1:5]

    ## [1] NA NA NA NA NA

    class(grouter$routes$route_desc)

    ## [1] "logical"

Using gtfstools::read_gtfs() leads to results a bit more consistent, because I explicitly specify this column to be read as character:

    gttools2 <- read_gtfs(poa2_path)
    gttools3 <- read_gtfs(poa3_path)

    gttools2$routes$route_desc[1:5]

    ## [1] "oi" ""   ""   ""   ""

    class(gttools2$routes$route_desc)

    ## [1] "character"

    gttools3$routes$route_desc[1:5]

    ## [1] "3" ""  ""  ""  ""

    class(gttools3$routes$route_desc)

    ## [1] "character"

    gttools$routes$route_desc[1:5]

    ## [1] "" "" "" "" ""

    class(gttools$routes$route_desc)

    ## [1] "character"

Cheers!

mpadge commented 3 years ago

Thanks @dhersz - great that these conversations are continuing in issues, where they can be better focussed and structured, and most importantly they are publicly available. Great catch on your part there, especially given that I claimed in emails that your software wasn't confirming and implied mine was - it actually was gtfsrouter that wasn't conforming. That commit fixes that by adding a new file which ensures that all column types in all GTFS tables conform to the actual standards. (This is obviously imperfect, because R really doesn't do types as well as many other languages, but it's the best that can be done here.) You or any of the r-transit folk should feel free to copy that file straight across for your own usage. Direct link to file containing GTFS column definitions.

Ping @tbuckl, @polettif

mpadge commented 3 years ago

@dhersz Times are still ultimately translated to integers in gtfsrouter, which contradicts the standards. We should probably all ensure that our software documents any points at which column types diverge from standards, because that'll be the best way to understand exactly where representations obtained from our various packages actually differ. I'll re-open this until I've done that.

mpadge commented 3 years ago

Documentation now states:

#' @note Column types in each table of the returned object conform to GTFS
#' standards (\url{https://developers.google.com/transit/gtfs/reference}),
#' except that "Time" fields in the "stop_times" table are converted to integer
#' values, rather than as character or "Time" objects ("HH:MM:SS"). These can be
#' converted back to comply with GTFS standards by applying the `hms::hms()`
#' function to the two time columns of the "stop_times" table.

Obviously we should all do things like that, but perhaps even more usefully, and a question for all (@dhersz, @tbuckl, @polettif, @rafapereirabr):

That standards are encoded in the file I linked to above, so we can all use that, and then, for example, gtfsrouter would have a gtfs_standard() function that does the conversions described above. There is no underlying difference in between data.table and other representations (like from tidytransit) - all are stored in memory in the same way, and lists of columns, so this should work pretty seamlessly (data.table would stay data.table, same for data.frame and whatever else).

polettif commented 3 years ago

I think a "implementation standard" is a good idea, especially since your gtfs-reference-fields.R is much easier to read than tidytransit's spec.R.

The standard is easy to implement if we keep types to characters and integer/numeric. However, those are not really usable for dates and time columns. tidytransit converts dates as Date objects and there's set_hms_times that can be used on a feed. Integers are more suitable for calcuations (raptor uses them instead of hms) but for a user they're hard to understand and work with. Is this the main point of discussion, how to handle times and dates?

mpadge commented 3 years ago

Yeah @polettif, I would imagine that would be the main point of discussion. The types in the standard are very clear, but R knows nothing about ENUM types, and has so many different ways of handling Date representations, which is all the more reason why having such standards is important.

More importantly: this is obviously not the best place for such a discussion. I'd suggest initially a wiki, as well as using github Discussions. I would also suggest that r-transit would be the obvious go-to point for both of those. Do you want to open up discussions there, and maybe start a wiki on standards for representing GTFS feeds in R? Or even start a new docs-only repo specifying the standards in a README? Whatever seems best to you. If you agree to hosting it there, I'll be happy to contribute once something is up. Thanks!

dhersz commented 3 years ago

@mpadge Great job with the gtfs-reference-fields. I do something similar in gtfstools using a internal object (gtfs_metadata), which is heavily based on tidytransit's spec.R but organized a bit differently, but yours is much easier to read, as @polettif said. The main difference between this 3 objects/functions is that tidytransit's and gtfstools' also set which columns and files are required/optional, but that's a quick addition to gtfs-reference-fields if so you wish.

I like the idea of agreeing on standards, that could make interoperability between packages much easier. I agree with @polettif that it seems natural to keep fields as characters and integers, but perhaps creating a github discussion is a better way to structure this conversation.

rafapereirabr commented 3 years ago

Hi all. Following the original motivation behind the GTFS standard, I like the idea of having a common standard for representing GTFS feeds in R. The gtfs_reference_fields by @mpadge is a great contribution in this direction. Thanks, Mark!

Nonetheless, I also understand that some packages might deviate a bit from the stardand in order to get better performance. In our case, read_gtfs{gtfs2gps} would deviate from the standard in the following aspects:

  1. we only read a few columns that are essential for the purposes of the package
  2. we read all columns as as.character
  3. finally, we convert the start_time and end_time columns to data.table::as.ITime

I understand that gtfs2gps is a peculiar package with a very specific focus, but I'd be glad to discuss how our package can better integrate with the GTFS community in R. Ping to my co-developers @pedro-andrade-inpe and @Joaobazzo.

ps. Thanks @dhersz and @mpadge for kicking off the discussion. I'm glad to keep the discussion here or elsewhere.

mpadge commented 3 years ago

Thanks for chipping in @rafapereirabr! Deviations are naturally welcome; the point of an agreed-upon standard is to provide a common reference against which you can very easily explain those deviations. We just need a home for the standards ... In the meantime I'm going to re-open this because of the issue raised by @dhersz that the spec as written here does not contain the types ("optional", "required", ...), which it should have. The following commit will re-arrange slightly to add those types to the return object.