UrbanAnalyst / gtfsrouter

Routing and analysis engine for GTFS (General Transit Feed Specification) data
https://urbananalyst.github.io/gtfsrouter/
82 stars 17 forks source link

error handling - column headings #54

Closed dcooley closed 2 years ago

dcooley commented 3 years ago

Working with Melbourne's transit feed is not always straight forward, and this morning I ran into this particular issue which tripped me up for a few hours, where the column heading has a non-Ascii symbol in it.

So my request is, is it worth adding in a 'clean column names' -type function when extracting the feed to try and handle this?

Here's the exmaple

library(data.table)
library(gtfsrouter)

url <- "http://data.ptv.vic.gov.au/downloads/gtfs.zip"

temp_dir <- tempdir()
download.file(url, destfile = paste0(temp_dir, "/gtfs.zip") )
list.files( temp_dir )

unzip( paste0(temp_dir, "/gtfs.zip"), exdir = paste0(temp_dir) )

## Melbourne GTFS is saved in folders numbered 1:11 for different modes
## <sigh>
list.files( temp_dir )
## 3 == tram
list.files(paste0(temp_dir, "/3"))
gtfs_tram <- extract_gtfs( paste0(temp_dir, "/3/google_transit.zip") )
# unlink(temp_dir)

I found this issue when running gtfsrouter:::join_service_id_to_stops(), as it gave me all NAs for trip_id

tram_stop_service <- gtfsrouter:::join_service_id_to_stops( gtfs_tram )
unique( tram_stop_service$services )
# [1] NA

So diving into the stop_times object I found the issue

dt_stoptimes <- copy( gtfs_tram$stop_times )
dput( dt_stoptimes[1, ] )

Copying this output back into my script highlighted the rogue character

Screen Shot 2020-11-05 at 10 05 04 am

This is fixed by the iconv() command.

setnames(
  gtfs_tram$stop_times
  , names( gtfs_tram$stop_times )
  , iconv( names( gtfs_tram$stop_times ), "latin1", "ASCII", sub = "" )
)

tram_stop_service <- gtfsrouter:::join_service_id_to_stops( gtfs_tram )
head(tram_stop_service )

# stop_id services
# 1:   19725    T0+a6
# 2:   19372    T0+a6
# 3:   19371    T0+a6
# 4:   19370    T0+a6
# 5:   19369    T0+a6
# 6:   18184    T0+a6

Is something like this worth adding to gtfsrouter?

mpadge commented 3 years ago

Yeah, absolutely worth adding, thanks @dcooley! Feel free to PR if you want, otherwise I'll get around to it within the next couple of weeks, I'd reckon. Melbourne is interesting because of all of the separate feeds from separate providers with no transfers (buses to trams, trams to trains, ...). I'd love somebody to take the new gtfs_transfer_table() function for a spin on that feed to see how it performs. Just in case that happens to entice you ... :smile:

dcooley commented 3 years ago

yeah I was trying gtfs_transfer_table() yesterday, but it wouldn't find any transfers. I dug into the code and kept finding this line

nbs[which(!nbs %in% service_stops)]

returned 0 results every iteration. I didn't go any further into it, but did read through the other issue to see if I could understand the logic a bit better. I'll see if I can find some more time on it this weekend.

mpadge commented 3 years ago

Or just put a wee reprex here, and I'll dig in with you

mpadge commented 3 years ago

Update: I can't reproduce your initial buggy behaviour. I don't get any rogue characters at all, and everything is read just fine. I'm sure bugs will arise in this regard, but they are also more likely to be in non-UTF or whatever encoding of names of agencies, routes, headsigns, and stops. So if something like your code were to be incorporated, I suspect it would need to be a comprehensive rectification of all potential encoding problems. A naive solution would be to iconv all column names and all "name" columns, so I guess that ability could be exposed through a helper function using latin1 as a default from encoding, but enabling any user-specified encoding. Whaddayareckon?

mpadge commented 3 years ago

This should hopefully all be addressed via #84, but i'll keep the issue open to ensure everything is okay