Closed dhersz closed 3 years ago
Thanks @dhersz - great that these conversations are continuing in issues, where they can be better focussed and structured, and most importantly they are publicly available. Great catch on your part there, especially given that I claimed in emails that your software wasn't confirming and implied mine was - it actually was gtfsrouter
that wasn't conforming. That commit fixes that by adding a new file which ensures that all column types in all GTFS tables conform to the actual standards. (This is obviously imperfect, because R really doesn't do types as well as many other languages, but it's the best that can be done here.) You or any of the r-transit
folk should feel free to copy that file straight across for your own usage. Direct link to file containing GTFS column definitions.
Ping @tbuckl, @polettif
@dhersz Times are still ultimately translated to integers in gtfsrouter
, which contradicts the standards. We should probably all ensure that our software documents any points at which column types diverge from standards, because that'll be the best way to understand exactly where representations obtained from our various packages actually differ. I'll re-open this until I've done that.
Documentation now states:
#' @note Column types in each table of the returned object conform to GTFS
#' standards (\url{https://developers.google.com/transit/gtfs/reference}),
#' except that "Time" fields in the "stop_times" table are converted to integer
#' values, rather than as character or "Time" objects ("HH:MM:SS"). These can be
#' converted back to comply with GTFS standards by applying the `hms::hms()`
#' function to the two time columns of the "stop_times" table.
Obviously we should all do things like that, but perhaps even more usefully, and a question for all (@dhersz, @tbuckl, @polettif, @rafapereirabr):
That standards are encoded in the file I linked to above, so we can all use that, and then, for example, gtfsrouter
would have a gtfs_standard()
function that does the conversions described above. There is no underlying difference in between data.table
and other representations (like from tidytransit
) - all are stored in memory in the same way, and lists of columns, so this should work pretty seamlessly (data.table
would stay data.table
, same for data.frame
and whatever else).
I think a "implementation standard" is a good idea, especially since your gtfs-reference-fields.R is much easier to read than tidytransit's spec.R.
The standard is easy to implement if we keep types to characters and integer/numeric. However, those are not really usable for dates and time columns. tidytransit converts dates as Date
objects and there's set_hms_times that can be used on a feed. Integers are more suitable for calcuations (raptor uses them instead of hms) but for a user they're hard to understand and work with. Is this the main point of discussion, how to handle times and dates?
Yeah @polettif, I would imagine that would be the main point of discussion. The types in the standard are very clear, but R knows nothing about ENUM
types, and has so many different ways of handling Date representations, which is all the more reason why having such standards is important.
More importantly: this is obviously not the best place for such a discussion. I'd suggest initially a wiki, as well as using github Discussions. I would also suggest that r-transit
would be the obvious go-to point for both of those. Do you want to open up discussions there, and maybe start a wiki on standards for representing GTFS feeds in R? Or even start a new docs-only repo specifying the standards in a README? Whatever seems best to you. If you agree to hosting it there, I'll be happy to contribute once something is up. Thanks!
@mpadge Great job with the gtfs-reference-fields. I do something similar in gtfstools using a internal object (gtfs_metadata), which is heavily based on tidytransit's spec.R but organized a bit differently, but yours is much easier to read, as @polettif said. The main difference between this 3 objects/functions is that tidytransit's and gtfstools' also set which columns and files are required/optional, but that's a quick addition to gtfs-reference-fields if so you wish.
I like the idea of agreeing on standards, that could make interoperability between packages much easier. I agree with @polettif that it seems natural to keep fields as characters and integers, but perhaps creating a github discussion is a better way to structure this conversation.
Hi all. Following the original motivation behind the GTFS standard, I like the idea of having a common standard for representing GTFS feeds in R. The gtfs_reference_fields
by @mpadge is a great contribution in this direction. Thanks, Mark!
Nonetheless, I also understand that some packages might deviate a bit from the stardand in order to get better performance. In our case, read_gtfs{gtfs2gps}
would deviate from the standard in the following aspects:
as.character
start_time
and end_time
columns to data.table::as.ITime
I understand that gtfs2gps
is a peculiar package with a very specific focus, but I'd be glad to discuss how our package can better integrate with the GTFS community in R. Ping to my co-developers @pedro-andrade-inpe and @Joaobazzo.
ps. Thanks @dhersz and @mpadge for kicking off the discussion. I'm glad to keep the discussion here or elsewhere.
Thanks for chipping in @rafapereirabr! Deviations are naturally welcome; the point of an agreed-upon standard is to provide a common reference against which you can very easily explain those deviations. We just need a home for the standards ... In the meantime I'm going to re-open this because of the issue raised by @dhersz that the spec as written here does not contain the types ("optional", "required", ...), which it should have. The following commit will re-arrange slightly to add those types to the return object.
Hello @mpadge, I decided that opening this issue would be better than to keep this topic alive in our email exchange, which would make for some "crossed" conversation.
After you mentioned that I wasn't making some necessary conversions when reading the GTFS files (a very fair comment, since I indeed don't convert times to seconds and don't handle dates as integers, like you do here) I noticed that using waldo::compare() led to some weird comparisons between the objects generated by our reading functions. I think that's because some columns do not have their data types specified in
extract_gtfs()
. I'll paste the gist I produced below:P.S. I'm not sure if this behaviour is desired. If it is, please ignore this issue.
For this demonstration I’ll use a dataset included in
gtfstools
.We can see that
routes$route_desc
is read as a character vector full of empty entrances (i.e. "") byread_gtfs()
, while it’s read as a logical vector full of NAs byextract_gtfs()
.agency$agency_phone
is also read as a character byread_gtfs()
, while it’s read as an integer byextract_gtfs()
. According to the official GTFS reference, both this fields should be read as strings.In fact, if I change these fields just a tad bit,
extract_gtfs()
reads it differently, which leads me to believe that it’s usingfread()
(or whichever function used to read these files) to guess the fields' types, instead of explicitly setting how they should be read.To exemplify it, I’ll manually edit the
poa_gtfs.zip
file:poa_gtfs2
has an “oi” in the first row of route_desc, and nothing else.poa_gtfs3
has a 3 in this same row.As you can see below, using
extract_gtfs()
to import the GTFS leads to different column types:Using
gtfstools::read_gtfs()
leads to results a bit more consistent, because I explicitly specify this column to be read as character:Cheers!