Updates from Malcolm - Githubissues

mem48 commented 1 year ago

@Robinlovelace I've updated this PR with a new batch_read2 function that is 6x times faster, as based on journey2 code.

system.time({r1 = batch_read(file)})
   user  system elapsed 
 241.40    2.30  242.94 

system.time({r2 = batch_read2(file)})
   user  system elapsed                                                                                                                                        
  41.24    0.46   39.94

Test file was 200 MB of commute routes around Edinburgh.

Outputs are not identical so need some checking.

> summary(r1)
 route_number           name             distances            time           busynance          elevations    
 Length:205753      Length:205753      Min.   :    1.0   Min.   :   1.00   Min.   :     1.0   Min.   : NA     
 Class :character   Class :character   1st Qu.:   39.0   1st Qu.:   8.00   1st Qu.:    78.0   1st Qu.: NA     
 Mode  :character   Mode  :character   Median :  107.0   Median :  22.00   Median :   218.0   Median : NA     
                                       Mean   :  242.3   Mean   :  42.51   Mean   :   800.3   Mean   :NaN     
                                       3rd Qu.:  280.0   3rd Qu.:  50.00   3rd Qu.:   635.0   3rd Qu.: NA     
                                       Max.   :11541.0   Max.   :1735.00   Max.   :266522.0   Max.   : NA     
                                                                                              NA's   :205753  
 start_longitude  start_latitude  finish_longitude finish_latitude crow_fly_distance    event               whence        
 Min.   :-3.706   Min.   :55.83   Min.   :-3.702   Min.   :55.85   Min.   :   10     Length:205753      Min.   :1.69e+09  
 1st Qu.:-3.447   1st Qu.:55.91   1st Qu.:-3.438   1st Qu.:55.91   1st Qu.: 1680     Class :character   1st Qu.:1.69e+09  
 Median :-3.279   Median :55.95   Median :-3.280   Median :55.94   Median : 3550     Mode  :character   Median :1.69e+09  
 Mean   :-3.321   Mean   :55.96   Mean   :-3.317   Mean   :55.96   Mean   : 4605                        Mean   :1.69e+09  
 3rd Qu.:-3.192   3rd Qu.:55.98   3rd Qu.:-3.197   3rd Qu.:55.97   3rd Qu.: 6169                        3rd Qu.:1.69e+09  
 Max.   :-3.083   Max.   :56.17   Max.   :-3.104   Max.   :56.17   Max.   :28973                        Max.   :1.69e+09  

     speed      itinerary     plan               note               length        quietness           west       
 Min.   :16   Min.   :0   Length:205753      Length:205753      Min.   :   10   Min.   :  1.00   Min.   :-3.716  
 1st Qu.:16   1st Qu.:0   Class :character   Class :character   1st Qu.: 2411   1st Qu.: 40.00   1st Qu.:-3.465  
 Median :16   Median :0   Mode  :character   Mode  :character   Median : 4787   Median : 60.00   Median :-3.295  
 Mean   :16   Mean   :0                                         Mean   : 6117   Mean   : 59.96   Mean   :-3.347  
 3rd Qu.:16   3rd Qu.:0                                         3rd Qu.: 8139   3rd Qu.: 80.00   3rd Qu.:-3.224  
 Max.   :16   Max.   :0                                         Max.   :50969   Max.   :100.00   Max.   :-3.105  

     south            east            north         leaving            arriving         grammesCO2saved    calories    
 Min.   :55.81   Min.   :-3.702   Min.   :55.85   Length:205753      Length:205753      Min.   :   2    Min.   :  0.0  
 1st Qu.:55.90   1st Qu.:-3.413   1st Qu.:55.93   Class :character   Class :character   1st Qu.: 449    1st Qu.: 37.0  
 Median :55.93   Median :-3.240   Median :55.96   Mode  :character   Mode  :character   Median : 892    Median : 84.0  
 Mean   :55.95   Mean   :-3.291   Mean   :55.98                                         Mean   :1140    Mean   :106.7  
 3rd Qu.:55.97   3rd Qu.:-3.174   3rd Qu.:55.98                                         3rd Qu.:1517    3rd Qu.:143.0  
 Max.   :56.16   Max.   :-3.083   Max.   :56.17                                         Max.   :9501    Max.   :782.0  

   edition          gradient_segment   elevation_change provisionName      gradient_smooth             geometry     
 Length:205753      Min.   :0.000000   Min.   :  0.00   Length:205753      Min.   :0.000000   LINESTRING   :205753  
 Class :character   1st Qu.:0.005952   1st Qu.:  1.00   Class :character   1st Qu.:0.005906   epsg:4326    :     0  
 Mode  :character   Median :0.017241   Median :  2.00   Mode  :character   Median :0.017013   +proj=long...:     0  
                    Mean   :0.024794   Mean   :  4.71                      Mean   :0.022349                         
                    3rd Qu.:0.033333   3rd Qu.:  6.00                      3rd Qu.:0.032258                         
                    Max.   :1.000000   Max.   :123.00                      Max.   :0.271186 

summary(r2)
      id                 time           busynance          quietness      signalledJunctions signalledCrossings
 Length:205722      Min.   :   1.00   Min.   :     1.0   Min.   :  1.00   Min.   :0.00000    Min.   :0.0000    
 Class :character   1st Qu.:   8.00   1st Qu.:    78.0   1st Qu.: 40.00   1st Qu.:0.00000    1st Qu.:0.0000    
 Mode  :character   Median :  22.00   Median :   218.0   Median : 60.00   Median :0.00000    Median :0.0000    
                    Mean   :  42.51   Mean   :   800.4   Mean   : 59.96   Mean   :0.06367    Mean   :0.1532    
                    3rd Qu.:  50.00   3rd Qu.:   635.0   3rd Qu.: 80.00   3rd Qu.:0.00000    3rd Qu.:0.0000    
                    Max.   :1735.00   Max.   :266522.0   Max.   :100.00   Max.   :6.00000    Max.   :9.0000    

     name                walk          elevations          distances           type             legNumber    distance      
 Length:205722      Min.   :0.00000   Length:205722      Min.   :    1.0   Length:205722      Min.   :1   Min.   :    1.0  
 Class :character   1st Qu.:0.00000   Class :character   1st Qu.:   39.0   Class :character   1st Qu.:1   1st Qu.:   39.0  
 Mode  :character   Median :0.00000   Mode  :character   Median :  107.0   Mode  :character   Median :1   Median :  107.0  
                    Mean   :0.02978                      Mean   :  242.3                      Mean   :1   Mean   :  242.3  
                    3rd Qu.:0.00000                      3rd Qu.:  280.0                      3rd Qu.:1   3rd Qu.:  280.0  
                    Max.   :1.00000                      Max.   :11541.0                      Max.   :1   Max.   :11541.0  

      flow            turn            startBearing      color           provisionName               geometry     
 Min.   :0.00     Length:205722      Min.   :  0.0   Length:205722      Length:205722      LINESTRING   :205722  
 1st Qu.:0.00     Class :character   1st Qu.: 90.0   Class :character   Class :character   epsg:4326    :     0  
 Median :0.00     Mode  :character   Median :180.0   Mode  :character   Mode  :character   +proj=long...:     0  
 Mean   :0.02                        Mean   :184.5                                                               
 3rd Qu.:0.00                        3rd Qu.:270.0                                                               
 Max.   :1.00                        Max.   :359.0                                                               
 NA's   :191329                                                                                                  
    start              finish          start_longitude  start_latitude  finish_longitude finish_latitude crow_fly_distance
 Length:205722      Length:205722      Min.   :-3.706   Min.   :55.83   Min.   :-3.702   Min.   :55.85   Min.   :   10    
 Class :character   Class :character   1st Qu.:-3.447   1st Qu.:55.91   1st Qu.:-3.438   1st Qu.:55.91   1st Qu.: 1681    
 Mode  :character   Mode  :character   Median :-3.279   Median :55.95   Median :-3.280   Median :55.94   Median : 3551    
                                       Mean   :-3.321   Mean   :55.96   Mean   :-3.317   Mean   :55.96   Mean   : 4606    
                                       3rd Qu.:-3.192   3rd Qu.:55.98   3rd Qu.:-3.197   3rd Qu.:55.97   3rd Qu.: 6169    
                                       Max.   :-3.083   Max.   :56.17   Max.   :-3.104   Max.   :56.17   Max.   :28973    

    event              whence              speed      itinerary     plan               note               length     
 Length:205722      Length:205722      Min.   :16   Min.   :0   Length:205722      Length:205722      Min.   :   10  
 Class :character   Class :character   1st Qu.:16   1st Qu.:0   Class :character   Class :character   1st Qu.: 2413  
 Mode  :character   Mode  :character   Median :16   Median :0   Mode  :character   Mode  :character   Median : 4787  
                                       Mean   :16   Mean   :0                                         Mean   : 6118  
                                       3rd Qu.:16   3rd Qu.:0                                         3rd Qu.: 8139  
                                       Max.   :16   Max.   :0                                         Max.   :50969  

      west            south            east            north         leaving            arriving         grammesCO2saved
 Min.   :-3.716   Min.   :55.81   Min.   :-3.702   Min.   :55.85   Length:205722      Length:205722      Min.   :   2   
 1st Qu.:-3.465   1st Qu.:55.90   1st Qu.:-3.413   1st Qu.:55.93   Class :character   Class :character   1st Qu.: 450   
 Median :-3.295   Median :55.93   Median :-3.240   Median :55.96   Mode  :character   Mode  :character   Median : 892   
 Mean   :-3.347   Mean   :55.95   Mean   :-3.291   Mean   :55.98                                         Mean   :1141   
 3rd Qu.:-3.224   3rd Qu.:55.97   3rd Qu.:-3.174   3rd Qu.:55.98                                         3rd Qu.:1517   
 Max.   :-3.105   Max.   :56.16   Max.   :-3.083   Max.   :56.17                                         Max.   :9501   

    calories       edition          gradient_segment   elevation_change  gradient_smooth   
 Min.   :  0.0   Length:205722      Min.   :0.000000   Min.   :  0.000   Min.   :0.000000  
 1st Qu.: 37.0   Class :character   1st Qu.:0.005952   1st Qu.:  1.000   1st Qu.:0.005917  
 Median : 84.0   Mode  :character   Median :0.017241   Median :  2.000   Median :0.017043  
 Mean   :106.8                      Mean   :0.024796   Mean   :  4.711   Mean   :0.022375  
 3rd Qu.:143.0                      3rd Qu.:0.033333   3rd Qu.:  6.000   3rd Qu.:0.032258  
 Max.   :782.0                      Max.   :1.000000   Max.   :123.000   Max.   :0.271186

Robinlovelace commented 1 year ago

Aha the benchmark is convincing. Are you sure you're comparing with the most recent dev version of cyclestreets? I guess so. Impressive! I wouldn't want to merge until the outputs are the same...

Robinlovelace commented 1 year ago

Heads-up @mem48 another way to speed it up: return less data. That's what the latest version batch() does.

mem48 commented 1 year ago

Are you sure you're comparing with the most recent dev version of cyclestreets?

I compared it against the current cyclestreets/master

I wouldn't want to merge until the outputs are the same

It is safe to merge as a separate function. But the same output would be ideal. Two main differences are

1) batch_read2 returns more columns. I don't think there is a performance gain in removing data. As my code pulls all the data from JSON in one step. I'd have to add code to remove columns which would add time. And can easily be done outside of the function. It makes more sence to me that a function that reads a file simply returns everything.

2) batch_read2 returns fewer rows. I think that batch_read is introducing a small number of duplicates

> summary(duplicated(r1$geometry))
   Mode   FALSE    TRUE 
logical   35798  169955 
> summary(duplicated(r2$geometry))
   Mode   FALSE    TRUE 
logical   35798  169924

All the geometry in r1 are in r2 so there is no missing data. Duplicated segments come from different routes using the same roads. But notice the number of unique segments is the same, and only the number of duplicates is different.

I've checked and found some cases of duplicated routes in the batch_read results that are not duplicated in the batch_read2

s1 = r1[r1$route_number == 721,] #4 rows duplicated each segment
s2 = r2[r2$route_number == 721,] #2 rows no duplication
qtm(s1, lines.lwd = 4) + qtm(s2, lines.col = "red")

3) batch_read2 returns a tibble, not a data frame

Robinlovelace commented 1 year ago

Sounds good and happy to merge. I see this though, can you resolve conflicts in journey2.R and approve the PR?

mem48 commented 1 year ago

@Robinlovelace I've fixed conflict and switched the batch() function to using my new batch_read() function. The old code is still there but commented out for now. I've also squeezed some extra performance out by switching to data.table and stringi these tricks could be used elsewhere in the package.

I've also fixed a bunch of existing warnings with package (undocumented params etc)

mem48 commented 1 year ago

I can't merge this repo, not an admin?

Robinlovelace commented 1 year ago

Managed to merge :tada: catch up tomorrow.

cyclestreets / cyclestreets-r

Updates from Malcolm #61