Closed Robinlovelace closed 1 year ago
Aha the benchmark is convincing. Are you sure you're comparing with the most recent dev version of cyclestreets
? I guess so. Impressive! I wouldn't want to merge until the outputs are the same...
Heads-up @mem48 another way to speed it up: return less data. That's what the latest version batch()
does.
Are you sure you're comparing with the most recent dev version of cyclestreets?
I compared it against the current cyclestreets/master
I wouldn't want to merge until the outputs are the same
It is safe to merge as a separate function. But the same output would be ideal. Two main differences are
1) batch_read2 returns more columns. I don't think there is a performance gain in removing data. As my code pulls all the data from JSON in one step. I'd have to add code to remove columns which would add time. And can easily be done outside of the function. It makes more sence to me that a function that reads a file simply returns everything.
2) batch_read2 returns fewer rows. I think that batch_read is introducing a small number of duplicates
> summary(duplicated(r1$geometry))
Mode FALSE TRUE
logical 35798 169955
> summary(duplicated(r2$geometry))
Mode FALSE TRUE
logical 35798 169924
All the geometry in r1
are in r2
so there is no missing data. Duplicated segments come from different routes using the same roads. But notice the number of unique segments is the same, and only the number of duplicates is different.
I've checked and found some cases of duplicated routes in the batch_read results that are not duplicated in the batch_read2
s1 = r1[r1$route_number == 721,] #4 rows duplicated each segment
s2 = r2[r2$route_number == 721,] #2 rows no duplication
qtm(s1, lines.lwd = 4) + qtm(s2, lines.col = "red")
3) batch_read2 returns a tibble, not a data frame
Sounds good and happy to merge. I see this though, can you resolve conflicts in journey2.R and approve the PR?
@Robinlovelace I've fixed conflict and switched the batch()
function to using my new batch_read()
function. The old code is still there but commented out for now. I've also squeezed some extra performance out by switching to data.table
and stringi
these tricks could be used elsewhere in the package.
I've also fixed a bunch of existing warnings with package (undocumented params etc)
I can't merge this repo, not an admin?
Managed to merge :tada: catch up tomorrow.
@Robinlovelace I've updated this PR with a new batch_read2 function that is 6x times faster, as based on journey2 code.
Test file was 200 MB of commute routes around Edinburgh.
Outputs are not identical so need some checking.