ITSLeeds / UK2GTFS

Convert UK transport data (TransXchange / ATOC CIF) to GTFS format in R
https://itsleeds.github.io/UK2GTFS/
GNU General Public License v3.0
37 stars 13 forks source link

Merge changes from TfWM #58

Open oweno-tfwm opened 12 months ago

oweno-tfwm commented 12 months ago

Large number of changes in this fork - likely to result in some instability until they are bedded in.

1) Fix activity on LT records not being imported 2) Fix overlay rules not being applied correctly Accompanied by a reasonable number of unit tests for a range of overlay situations 3) Merge was only written for narrow use case of tables & fields expected in this product – so when it was used to merge a BODS extract with the output from UK2GTFS it resulted in tables and fields being thrown away. Now at a minimum maintains all input tables and fields, even if it doesn’t do any special processing on them 4) Nailed down data type definition rather than letting some of them being determined automatically, have done for all fields present in BODS GTFS, but because some of these are not present in CIF data, this generates a number of unit test warnings. 5) Write code also was for a specific use case of expected tables & fields, so when presented with merged BODS gtfs file threw data away, now writes all presented tables to it to the output. Have also changed the default options on this fn to the highest performing combination

This means you can do things like using it to convert CIF to CSV like this…

Read In each File

mca <- importMCA(
  file = "..\\RAIL-CIF\\RAIL-CIF.gz",
  silent = FALSE,
  working_timetable = FALSE,
  public_only = public_only
)
writeGtfs( mca, filename = "RAIL-CIF-CSV" )

6) Various existing filters within the code filter out services / calls for non-public use, but the control of these wasn’t exposed – now exposed up the call stack to be in control of caller. 7) Even after all that filtering some non-public services were leaking into output, (ECS moves) so since GTFS seems to be tolerant of extra fields, have exposed some extra fields, in the shape of routes.train_category, which then allows the consumer of the data to filter some more as they wish. 8) …and then in gtfs_clean() enhanced the ‘public only’ filtering to only retain train_category that are for public use (i.e. provides mechanism to filter out ECS moves) 9) Along a similar lines have added a trips.power_type field so it can be analysed which services are dirty diesel and which are lovely clean electric ones (I’ve not actually yet tried pumping these into OTP to see if it moans about the extra fields) 10) Added mode name onto trip long name so it’s “Bus from….” “Train from….” “Ship from…” 11) The new loadData functionality was getting triggered every time the package was loaded – including in every child thread when started. So have put in a fix to stop this getting activated when creating a child process.

 Still seems to be absolutely battering the API to get fresh data every time I build, think the caching may need a bit more work to look at some dates so it only ‘phones home’ at most once a month ?

12) Reworked the location load code to make it work with csv files, RDA files containing lat,long coordinates, and RDA files containing geometry objects. 13) Merged in the recent changes in main trunk (was a bit of a painful merge, think I’ve got it all, but may have made some errors) 14) Since this is a lot of change, some of which is likely to be breaking – version number uprev to 1.0

performance

15) Changed from data.frame to data.table in main atoc load / processing and merge. There is a high chance that this is a breaking change in some code paths – while I’ve got good confidence in the main NR code path because that’s the one I’m testing heavily – and also confidence in paths I’ve written units tests for (although they are mainly testing the success path and don’t probe edge cases) I’m relying on the unit tests for all the other code paths, so it’s quite likely something somewhere is broken by this change 16) About a dozen rounds of performance tuning – my initial cut after making all these changes didn’t perform well, so have done a lot of profiling and performance tuning on it to minimise time consuming malloc() and maximise vectorisation

So that’s a lot of change in one go, likely to be a bit of instability on code paths with low unit test coverage.

mem48 commented 12 months ago

Wow, thats quite the PR! THis may take some time to check.

oweno-tfwm commented 12 months ago

Yeah, probably easiest way is to review the unit tests first, then review the completely new file atoc_overlay.R , then pull down both branches and diff across them rather than the line-by-line view that the online diff tool tends to give, which is a bit myopic for changes on this scale.

CI build seems to be falling over on something related to update_data stuff, then doesn't seem to get any further into the unit tests in the github CI process, they work ok in R Studio.

_Warning in file(con, "w") : cannot open file '/home/runner/work/_temp/package/UK2GTFS/extdata/date.txt': No such file or directory Warning in value[3L] :

Running specific tests for package ‘UK2GTFS’ Running ‘testthat.R’ Error: Error: Failure in /home/runner/work/_temp/package/UK2GTFS/UK2GTFS-tests/testthat.Rout.fail_