Bug in code that generates the mode-split csv file

Siequnu commented 3 years ago

The development appears to be gaining several thousand new commuters in Go Active mode?

Robinlovelace commented 3 years ago

Woops. Yes will take a look.

Robinlovelace commented 3 years ago

There's also an issue for Allerton Bywater:

Peek 2021-03-07 16-55

Robinlovelace commented 3 years ago

Doing a sanity test just now on data from Allerton Bywater and it seems OK:

> (shift_totals = dutch_mode_totals - base_mode_totals)
 walk_godutch cycle_godutch drive_godutch 
           13            68           -81 
> sum(shift_totals)
[1] 0
> base_mode_totals = desire_lines_final %>%
+   sf::st_drop_geometry() %>%
+   select(matches("base")) %>% 
+   select(-matches("all|tri")) %>% 
+   colSums()
> dutch_mode_totals = desire_lines_final %>%
+   sf::st_drop_geometry() %>%
+   select(matches("dutch")) %>% 
+   select(-matches("all|tri")) %>% 
+   colSums()
> 
> (shift_totals = dutch_mode_totals - base_mode_totals)
 walk_godutch cycle_godutch drive_godutch 
           13            68           -81 
> sum(shift_totals)
[1] 0

Robinlovelace commented 3 years ago

Reproducible script suggests that the data in the desire-lines-many.geojson file is OK:

# Aim: check scenario results for https://github.com/cyipt/actdev/issues/129

library(tidyverse)

f = "https://github.com/cyipt/actdev/raw/main/data-small/allerton-bywater/desire-lines-many.geojson"
desire_lines_final = sf::read_sf(f)
desire_lines_final
#> Simple feature collection with 30 features and 12 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -1.55257 ymin: 53.72459 xmax: -1.349795 ymax: 53.802
#> geographic CRS: WGS 84
#> # A tibble: 30 x 13
#>    geo_code1        geo_code2 purpose all_base trimode_base walk_base cycle_base
#>    <chr>            <chr>     <chr>      <dbl>        <dbl>     <dbl>      <dbl>
#>  1 allerton-bywater E02002398 commute        2            2         0          0
#>  2 allerton-bywater E02002399 commute        7            6         0          0
#>  3 allerton-bywater E02002402 commute        6            4         0          0
#>  4 allerton-bywater E02002402 commute        6            5         0          0
#>  5 allerton-bywater E02002403 commute        8            5         0          0
#>  6 allerton-bywater E02002403 commute        8            7         1          0
#>  7 allerton-bywater E02002404 commute        4            2         0          0
#>  8 allerton-bywater E02002416 commute        7            5         0          0
#>  9 allerton-bywater E02002416 commute        8            6         1          0
#> 10 allerton-bywater E02002417 commute        6            5         1          0
#> # … with 20 more rows, and 6 more variables: drive_base <dbl>, length <dbl>,
#> #   walk_godutch <dbl>, cycle_godutch <dbl>, drive_godutch <dbl>,
#> #   geometry <LINESTRING [°]>

base_mode_totals = desire_lines_final %>%
  sf::st_drop_geometry() %>%
  select(matches("base")) %>% 
  select(-matches("all|tri")) %>% 
  colSums()
dutch_mode_totals = desire_lines_final %>%
  sf::st_drop_geometry() %>%
  select(matches("dutch")) %>% 
  select(-matches("all|tri")) %>% 
  colSums()

base_mode_totals
#>  walk_base cycle_base drive_base 
#>         32          4        268
dutch_mode_totals
#>  walk_godutch cycle_godutch drive_godutch 
#>            45            72           187

(shift_totals = dutch_mode_totals - base_mode_totals)
#>  walk_godutch cycle_godutch drive_godutch 
#>            13            68           -81
if(sum(shift_totals) != 0) stop("Mode totals do not add up.")

min_vals = sapply(desire_lines_final %>% sf::st_drop_geometry() %>% select_if(is.numeric), min)
if(any(min_vals < 0)) stop("Negative values detected")

^{Created on 2021-03-07 by the reprex package (v1.0.0)}

Robinlovelace commented 3 years ago

Raising the question: how are those graphs calculated @Siequnu ? It would help fix the data if you could point to the data in this repo that is spurious or at least the code in the UI repo that reads-it in. Clear something is amiss.

Robinlovelace commented 3 years ago

My guess: the graph can best be fixed with code in the UI codebase. Here is what I think caused it.

The attributes were as follows:

Source: https://github.com/cyipt/actdev/blob/14f72705226ba36ca519a5f528b4e887b30c46a8/data-small/allerton-bywater/desire-lines-many.geojson

Now the columns appear in different positions, with the unwanted 'pdrive' variable gone:

Source: https://github.com/cyipt/actdev/blob/main/data-small/allerton-bywater/desire-lines-many.geojson

This could be fixed on the data side, by adding in the previous column. But I suggest it's worth fixing it on the UI side, using variable names not positions as the basis of the values.

Caveat I'm not 100% sure that is the cause, just a hunch based on the quick diagnostic checks that can help identify the cause of issues like this. @mvl22 and @Siequnu let me know: have I missed something?

Robinlovelace commented 3 years ago

Just assigned everyone as it may be a team effort to resolve this one. Priority issue.

Siequnu commented 3 years ago

The chart is simply generated by parsing the mode-split.csv for each region.

In this case, Great-Kneighton (among others) gained some really odd values after this commit yesterday.

Siequnu commented 3 years ago

This could be fixed on the data side, by adding in the previous column. But I suggest it's worth fixing it on the UI side, using variable names not positions as the basis of the values.

Variable names are already being used.

Siequnu commented 3 years ago

Taking a close look at the example above, it appears that — among others — the Great Kneighton 6-10 band gained several spurious numbers. Here's how they match up — I'm comparing the pre- and post-commit 6-10 band data for Great Kneighton. Here's the link to the .csv file.

6-10, 6-10 <— the band
10, 10
212.23671, 212
161.16725300000002, 161
4.642678, 5
37.804664, 38
118.719909, 119
51.06945699999997, 51
1.326479, 5     <— !? This is where things start to go a bit 🍐-shaped!
67, 2470          <— 2470!? the corresponding key for this is cycle_goactive
89.326479, 3026*
53, 1153 *
2, 2
18, 18
56, 56
24, 24
1, 1
32, 1165 *
42, 1427 *
25, 544 *

I've added * to places where massive numbers have suddenly appeared.

As there are no numbers in the original that are as large as the new spurious data, it wouldn't follow that this is a change in order — in any case a change in order would have no effect, as I avoid using implied indexes by default.

It appears that in some cases, a lot more than rounding has happened? But in the CSV file the header keys remain the same and in the same order.

So parsing the file and querying the key cycle_goactive for this band (6-10), as I do, will return 2470 as opposed to 67 which is what it used to be before this commit.

@Robinlovelace I've shut the actdev-ui issue you opened as this definitely seems to a data issue, and can't be resolved by UI code.

mvl22 commented 3 years ago

But I suggest it's worth fixing it on the UI side, using variable names not positions as the basis of the values.

Perhaps norms are different in the R world, but frankly I would regard use of index positions, which are inherently unstable, when names exist, as pretty sloppy work in UI development. UIs need to be defensively coded, so we avoid that kind of thing.

Siequnu commented 3 years ago

Since we are on the topic, what do the percent fields mean? We aren't currently using these for data visualisation, but in the screenshot attached, almost none of the percentages for base and goactive actually add up to 100%. Data from the great-kneighton mode-split.csv file, but many inconsistencies of this kind throughout the mode-split sites. Some percentage blocks like the 3-6 band on go_active actually add up to 3138%? Screenshot 2021-03-08 at 11 34 39

Robinlovelace commented 3 years ago

Agreed, this is an issue with the mode split csv file and it looks like a bug introduced by @joeytalbot here: https://github.com/cyipt/actdev/commit/7d84d4cfc2359aea6097f3b75c820ba2129430f5

Robinlovelace commented 3 years ago

Good detective work. Will chat with Joey and fix it ASAP.

Robinlovelace commented 3 years ago

Discussed now with @joeytalbot who is looking into it.

joeytalbot commented 3 years ago

We've discovered the problem. It wasn't the rounding. But in the same commit I regenerated the mode split files, based on the latest updates to all the other scripts.

The recently introduced disaggregation code caused many of the origin and destination geo_codes to change, and this is what was causing the problem.

I think we've fixed this now, I'm just checking through to make sure.

joeytalbot commented 3 years ago

This is now fixed. https://github.com/cyipt/actdev/commit/38748a58703d9da542302038cfb7cd3cd9d3ff03

joeytalbot commented 3 years ago

Since we are on the topic, what do the percent fields mean? We aren't currently using these for data visualisation, but in the screenshot attached, almost none of the percentages for base and goactive actually add up to 100%. Data from the great-kneighton mode-split.csv file, but many inconsistencies of this kind throughout the mode-split sites. Some percentage blocks like the 3-6 band on go_active actually add up to 3138%?

The percentages should add up to something close to 100%. But some still seem to be a few percent off.

Robinlovelace commented 3 years ago

The new data seems to produce better results but it still seems like there are issues to me so re-opening.

Can you double check @joeytalbot ?

The GIF below looks better than the GIF above for the same place (Allerton Bywater) but still looks off to me.

Peek 2021-03-08 15-02

Robinlovelace commented 3 years ago

Heads-up @joeytalbot I've just double checked the input desire line data and cannot see any issue with this so guess it must be your mode split code. Let me know, happy to chat about solutions.

``` r library(tidyverse) f = "https://github.com/cyipt/actdev/raw/main/data-small/allerton-bywater/desire-lines-many.geojson" desire_lines_final = sf::read_sf(f) desire_lines_final #> Simple feature collection with 30 features and 12 fields #> geometry type: LINESTRING #> dimension: XY #> bbox: xmin: -1.55257 ymin: 53.72459 xmax: -1.349795 ymax: 53.802 #> geographic CRS: WGS 84 #> # A tibble: 30 x 13 #> geo_code1 geo_code2 purpose all_base trimode_base walk_base cycle_base #> #> 1 allerton-bywater E02002398 commute 2 2 0 0 #> 2 allerton-bywater E02002399 commute 7 6 0 0 #> 3 allerton-bywater E02002402 commute 6 4 0 0 #> 4 allerton-bywater E02002402 commute 6 5 0 0 #> 5 allerton-bywater E02002403 commute 8 5 0 0 #> 6 allerton-bywater E02002403 commute 8 7 1 0 #> 7 allerton-bywater E02002404 commute 4 2 0 0 #> 8 allerton-bywater E02002416 commute 7 5 0 0 #> 9 allerton-bywater E02002416 commute 8 6 1 0 #> 10 allerton-bywater E02002417 commute 6 5 1 0 #> # … with 20 more rows, and 6 more variables: drive_base , length , #> # walk_godutch , cycle_godutch , drive_godutch , #> # geometry base_mode_totals = desire_lines_final %>% sf::st_drop_geometry() %>% select(matches("base")) %>% select(-matches("all|tri")) %>% colSums() dutch_mode_totals = desire_lines_final %>% sf::st_drop_geometry() %>% select(matches("dutch")) %>% select(-matches("all|tri")) %>% colSums() base_mode_totals #> walk_base cycle_base drive_base #> 32 4 268 dutch_mode_totals #> walk_godutch cycle_godutch drive_godutch #> 45 72 187 (shift_totals = dutch_mode_totals - base_mode_totals) #> walk_godutch cycle_godutch drive_godutch #> 13 68 -81 if(sum(shift_totals) != 0) stop("Mode totals do not add up.") min_vals = sapply(desire_lines_final %>% sf::st_drop_geometry() %>% select_if(is.numeric), min) if(any(min_vals < 0)) stop("Negative values detected") ``` ^{Created on 2021-03-08 by the [reprex package](https://reprex.tidyverse.org) (v1.0.0)}

Robinlovelace commented 3 years ago

Reproducible example showing it's an issue with the mode-split.csv file:

``` r library(tidyverse) f = "https://github.com/cyipt/actdev/raw/main/data-small/allerton-bywater/mode-split.csv" desire_lines_final = readr::read_csv(f) #> #> ── Column specification ──────────────────────────────────────────────────────── #> cols( #> .default = col_double(), #> distance_band = col_character() #> ) #> ℹ Use `spec()` for the full column specifications. base_results = desire_lines_final %>% select(distance_band, matches("base"), -matches("perc")) %>% tidyr::pivot_longer(cols = matches("base"), names_to = "mode") g1 = ggplot(base_results) + geom_bar(aes(distance_band, value, fill = mode), stat = "identity") go_results = desire_lines_final %>% select(distance_band, matches("go"), -matches("perc")) %>% tidyr::pivot_longer(cols = matches("go"), names_to = "mode") g2 = ggplot(go_results) + geom_bar(aes(distance_band, value, fill = mode), stat = "identity") library(patchwork) g1 + g2 ``` ![](https://i.imgur.com/aAx0WVb.png) ^{Created on 2021-03-08 by the [reprex package](https://reprex.tidyverse.org) (v1.0.0)}

joeytalbot commented 3 years ago

I'm having a look

Robinlovelace commented 3 years ago

Cheers Joey.

joeytalbot commented 3 years ago

This is now fixed in a pull request. https://github.com/cyipt/actdev/pull/139 The desire line disaggregation resulted in changes to the lengths of desire lines, and this is what led to the problem with the mode split charts.

Rounding errors will still mean some percentages add up to 101% or 99%, but I think that's fine.

Robinlovelace commented 3 years ago

Looking good, this is fixed. But should walking be going down in Go Active? This may be a problem with our uptake model code.

Peek 2021-03-09 13-20

cyipt / actdev

Bug in code that generates the mode-split csv file #129