ITSLeeds / UK2GTFS

Convert UK transport data (TransXchange / ATOC CIF) to GTFS format in R
https://itsleeds.github.io/UK2GTFS/
GNU General Public License v3.0
37 stars 13 forks source link

Invalid gtfs for latest ATOC - NA values and unknown stop_ids. #42

Open h-durham opened 2 years ago

h-durham commented 2 years ago

Hello, revisiting this after a while. I'm getting an invalid gtfs for the latest ATOC data.

Data was downloaded 23 Dec, filename isgtfs_ttis222.zip.

Are these easy to remedy?

> gtfs_validate_internal(gtfs)
Warning messages:
1: In gtfs_validate_internal(gtfs) : NA values in stops
2: In gtfs_validate_internal(gtfs) : Unknown stop_id in stop_times

In particular, we have the following NA values:

> gtfs$stops[rowSums(is.na(gtfs$stops)) > 0,]
     stop_id stop_code                    stop_name stop_lat stop_lon
3183 BSPSBUS      <NA> Bishops Lydeard Lydeard Arms 51.05540 -3.18876
3185  ABHLJN      <NA>           Abbeyhill Junction 55.95542 -3.17036
3207  DUNROD      <NA>                       Dunrod 55.91825 -4.84359
3568  NWTLWJ      <NA>               NEWTON WEST JN 55.81802 -4.14688
3617 SEVT730      <NA>     SEVERN TUNNEL SIG NT1730 51.58257 -2.80170
5030 CREWPLP      <NA>     CREWE UP & DN POTTERY LP 53.07933 -2.41846
5039 CREWUML      <NA>     CREWE UP MANCHESTER LOOP 53.09331 -2.43398
5182 CWLRSSJ      <NA>            COWLAIRS SOUTH JN 55.88094 -4.23906
5849 HETNLJN      <NA>              HEATON LODGE JN 53.67960 -1.71774
6131 DONCLCJ      <NA>            LOVERSALL CARR JN 53.48415 -1.07357
6388   HOLME      <NA>                    HOLME JN. 52.47127 -0.23738
6504 HORBRYJ      <NA>                   HORBURY JN 53.65918 -1.53116
7111 HAMBLEJ      <NA>            HAMBLETON EAST JN 53.77646 -1.14664
7112 HAMBLNJ      <NA>           HAMBLETON NORTH JN 53.78117 -1.15901
7230  EUSKJN      <NA>                 EAST USK JN. 51.58452 -2.96298
7262 HAUGHDJ      <NA>                 HAUGHHEAD JN 55.76989 -4.01281
7382 RTHGNEJ      <NA>           RUTHERGLEN EAST JN 55.82844 -4.19624
7457 STAN201      <NA>   STANSTED AIRPORT SIG L1201 51.88514  0.25569
7458 STANCLJ      <NA>     STANSTED COOPERS LANE JN 51.88661  0.25793
7940  STSNJN      <NA>                   STENSON JN 52.86534 -1.53425
8174 SWANSLW      <NA>            SWANSEA LOOP WEST 51.63791 -3.94316
8266 MRRYTNL      <NA>                ALLANTON LOOP 55.76693 -3.99792
8613 SHEETSJ      <NA>              SHEET STORES JN 52.88237 -1.27414
8897 SKELTON      <NA>           SKELTON JN. (YORK) 53.97117 -1.12073
9121   TRENT      <NA>                TRENT EAST JN 52.88501 -1.26475
9298   SOHAM      <NA>                        SOHAM 52.33420  0.32798
9304  SOKEJN      <NA>                    STOKE JN. 52.83906 -0.58012
9324  NWSTLP      <NA>                NEWSTEAD LOOP 53.07001 -1.22182
9939 WATSTJN      <NA>        Water Street Junction 53.47732 -2.25949
9954 NLRT478      <NA>    Northallerton Signal Y478 54.34730 -1.43854
9955  OXEN45      <NA>        Oxenholme Signal CE45 54.28966 -2.73614

And the following missing stop_ids:

> gtfs$stop_times[gtfs$stop_times$stop_id %!in% gtfs$stops$stop_id,]
        trip_id arrival_time departure_time stop_id stop_sequence pickup_type drop_off_type
1155238   61637     16:45:00       16:49:00 SOHA491             4           0           0
1341949   58910     12:45:00       12:49:00 SOHA491             4           0           0
2476719   51599     31:45:00       32:37:00 WMBYEFR            17           0           0

Thank you!

h-durham commented 2 years ago

Upon running the outputted gtfs.zip through OTP, I get

15:26:38.766 ERROR (OTPMain.java:46) An uncaught error occurred inside OTP: io error: entityType=org.onebusaway.gtfs.model.StopTime path=stop_times.txt lineNumber=1155239
org.onebusaway.csv_entities.exceptions.CsvEntityIOException: io error: entityType=org.onebusaway.gtfs.model.StopTime path=stop_times.txt lineNumber=1155239
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:161) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:120) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:115) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:108) ~[otp-2.0.0-shaded.jar:1.1]
    at org.opentripplanner.graph_builder.module.GtfsModule.loadBundle(GtfsModule.java:239) ~[otp-2.0.0-shaded.jar:1.1]
    at org.opentripplanner.graph_builder.module.GtfsModule.buildGraph(GtfsModule.java:130) ~[otp-2.0.0-shaded.jar:1.1]
    at org.opentripplanner.graph_builder.GraphBuilder.run(GraphBuilder.java:80) ~[otp-2.0.0-shaded.jar:1.1]
    at org.opentripplanner.standalone.OTPMain.startOTPServer(OTPMain.java:123) ~[otp-2.0.0-shaded.jar:1.1]
    at org.opentripplanner.standalone.OTPMain.main(OTPMain.java:39) ~[otp-2.0.0-shaded.jar:1.1]
Caused by: org.onebusaway.gtfs.serialization.EntityReferenceNotFoundException: entity reference not found: type=org.onebusaway.gtfs.model.Stop id=SOHA491
    at org.onebusaway.gtfs.serialization.GtfsReader.getAgencyForEntity(GtfsReader.java:211) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.gtfs.serialization.GtfsReader$GtfsReaderContextImpl.getAgencyForEntity(GtfsReader.java:302) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.gtfs.serialization.mappings.EntityFieldMappingImpl$ConverterImpl.convert(EntityFieldMappingImpl.java:104) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.gtfs.serialization.mappings.EntityFieldMappingImpl.translateFromCSVToObject(EntityFieldMappingImpl.java:61) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.IndividualCsvEntityReader.readEntity(IndividualCsvEntityReader.java:131) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.IndividualCsvEntityReader.handleLine(IndividualCsvEntityReader.java:98) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:157) ~[otp-2.0.0-shaded.jar:1.1]
    ... 8 common frames omitted

Line 1155239 of stop_times.txt reads:

61637,16:45:00,16:49:00,SOHA491,4,0,0
h-durham commented 2 years ago

Running gtfs_force_valid on this gtfs object does fix this error, but OTP then complains as follows:

12:55:17.555 ERROR (OTPMain.java:46) An uncaught error occurred inside OTP: io error: entityType=org.onebusaway.gtfs.model.Transfer path=transfers.txt lineNumber=2
org.onebusaway.csv_entities.exceptions.CsvEntityIOException: io error: entityType=org.onebusaway.gtfs.model.Transfer path=transfers.txt lineNumber=2
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:161) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:120) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:115) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:108) ~[otp-2.0.0-shaded.jar:1.1]
    at org.opentripplanner.graph_builder.module.GtfsModule.loadBundle(GtfsModule.java:239) ~[otp-2.0.0-shaded.jar:1.1]
    at org.opentripplanner.graph_builder.module.GtfsModule.buildGraph(GtfsModule.java:130) ~[otp-2.0.0-shaded.jar:1.1]
    at org.opentripplanner.graph_builder.GraphBuilder.run(GraphBuilder.java:80) ~[otp-2.0.0-shaded.jar:1.1]
    at org.opentripplanner.standalone.OTPMain.startOTPServer(OTPMain.java:123) ~[otp-2.0.0-shaded.jar:1.1]
    at org.opentripplanner.standalone.OTPMain.main(OTPMain.java:39) ~[otp-2.0.0-shaded.jar:1.1]
Caused by: org.onebusaway.gtfs.serialization.EntityReferenceNotFoundException: entity reference not found: type=org.onebusaway.gtfs.model.Stop id=ASHFKI
    at org.onebusaway.gtfs.serialization.GtfsReader.getAgencyForEntity(GtfsReader.java:211) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.gtfs.serialization.GtfsReader$GtfsReaderContextImpl.getAgencyForEntity(GtfsReader.java:302) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.gtfs.serialization.mappings.EntityFieldMappingImpl$ConverterImpl.convert(EntityFieldMappingImpl.java:104) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.gtfs.serialization.mappings.EntityFieldMappingImpl.translateFromCSVToObject(EntityFieldMappingImpl.java:61) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.IndividualCsvEntityReader.readEntity(IndividualCsvEntityReader.java:131) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.IndividualCsvEntityReader.handleLine(IndividualCsvEntityReader.java:98) ~[otp-2.0.0-shaded.jar:1.1]
    at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:157) ~[otp-2.0.0-shaded.jar:1.1]
    ... 8 common frames omitted

I then ran gtfs_clean on the validated gtfs object, this resulted in the same error.

Noticing that the stop_id=ASHFKI is indeed present in data/tiplocs.rda, I forced an update with remotes::install_github("ITSleeds/UK2GTFS@ed0fd418d90b837fb689bd154cf40a0b95912be5") (the latest commit available) and the ASHFKI issue persists after running gtfs_force_valid again.

I note that this latter issue seems similar to https://github.com/ITSLeeds/UK2GTFS/issues/29.

danieljuschus commented 2 years ago

I have noticed the unknown stop_ids as well. I suppose SOHA491 should be SOHAM and WMBYEFR should be WMBY. I hope there is a better solution than changing those afterwards in stop_times.txt.

h-durham commented 2 years ago

Well, I finally got this working but it took a few extra steps.

To summarize what is surely not the most efficient way:

  1. clone the repo and in R/atoc.R, comment out this line to save all stops. build, install and import this local version by doing this. If you want to change the name of this library, change the Package: field in your local DESCRIPTION file.
  2. use atoc2gtfs with standard args.
  3. use gtfs_force_valid to get rid of the SOHA491 error.
  4. sort out duplicates in gtfs$stops$stop_id. These ones I remove have stop_ids and stop_codes (PYECRNR, PYE), (SESABUS, ZBU) and (ESJLEDS, XES). (I just removed these afterwards in stops.txt - am new to R)
  5. some transfers involve stop_ids that don't exist. Remove them with gtfs$transfers <- gtfs$transfers[gtfs$transfers$to_stop_id %in% gtfs$stops$stop_id,] and gtfs$transfers <- gtfs$transfers[gtfs$transfers$from_stop_id %in% gtfs$stops$stop_id,]
  6. export with gtfs_write.
mem48 commented 2 years ago

This can happen when ATOC add new TIPLOCs that are not in the package database.

The CIF files contain the locations of TIPLOCs but they are often woefully inaccurate so by default UK2GTFS uses an internal database you can see by.

library(UK2GTFS)
head(as.data.frame(tiplocs))

If you want to draw the TIPLOC locations from the CIF file you can use locations = "file" in atoc2gtfs or you can provide your own sf data frame of points.

I periodically update the database so I'll have a look and see if new locations are required.

mem48 commented 2 years ago

Also missing stop_code is quite common with the ATOC data and is not a problem as they are optional in the GTFS spec

mem48 commented 2 years ago

I've pushed an update that will now check and pull in any missing tiplocs with a warning.

I've also added 10 new tiplocs to the database

stop_id stop_code                  stop_name stop_lat stop_lon
113   ACTONTN       ZAT                 ACTON TOWN 51.50273 -0.28114
1236   BRENTX       BCZ           BRENT CROSS WEST 51.56847 -0.22671
1564   BSTMNR       ZBM               BOSTON MANOR 51.49529 -0.32608
3485   ELINTN       ELT                EAST LINTON 55.98421 -2.66029
7020    MSBTN       MBT               MARSH BARTON 50.70419 -3.52228
8169   PTWYPR       PRI      PORTWAY PARK AND RIDE 51.48902 -2.68984
8427  RESTSTN       RSN                     RESTON 55.85015 -2.19483
10312 TOTNSSR       XSC   SALCOMBE SHADYCOMBE ROAD 50.71209 -3.78883
11585 ABARASQ       AER                  ABERAERON 52.24265 -4.25842
11685 CATZ016       LPD LUTON AIRPORT PARKWAY DART 51.87302 -0.39489
h-durham commented 2 years ago

Also missing stop_code is quite common with the ATOC data and is not a problem as they are optional in the GTFS spec

Noted- not good of OTP to complain, then!

I haven't tested again, but I presume your last commit has fixed the ASHFKI issue? That was a stop_id that was present in the included tiplocs data frame but was prematurely filtered out, with this step:

  # remove any unused stops
  stops <- stops[stops$stop_id %in% stop_times$stop_id, ]

Thank you for maintaining the library!

mem48 commented 2 years ago

Let me know if you have any more problems. Unfortunately missing or bad data is quite common and the only fix is for people to report it.

The package now tracks over 10,000 tiplocs but there are only about 2,500 stations in the UK. Which gives you an idea of how many temporary or intermittent ones are used for thing like bus replacement services etc.