New data layer: accidents / collisions

dabreegster commented 4 years ago

http://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0

This could be a useful layer to visualize on top of the existing map

dabreegster commented 4 years ago

https://github.com/DataCircles/traffic_collisions_viz_team

Robinlovelace commented 4 years ago

I'm up for helping out in some way. Linked to this, I'm part of a project that has made most road casualties in Britain over the last 10+ years easy to access: https://github.com/ropensci/stats19

I am experienced with R but pretty new to Rust but always up for learning new things.

dabreegster commented 4 years ago

Wow, stats19 is super thorough and well-documented! It seems you're well-versed in this space; are there any particular dataviz gaps that we could try to fill with A/B Street? The main advantage, versus overlaying with a typical road map background, could be to show road width and lane configuration, which might be relevant to understanding some of the accidents.

Some possible next steps, unordered:

Import the Seattle data as a KML file. There's an example of downloading KML and converting to a slightly easier format here. abst has a KML viewer in the devtools, which would let us initially look at Seattle's data in context
Visualize the stats19 data. Any particular area in Britain you're initially interested in? We could import an area and do the same KML trick to start.
Come up with a proper data format for accident data, if one doesn't already exist. Or this might not be worth it, if different agencies use very different methods for classifying accidents.
Design and prototype a UI for visualizing the data. The existing KML viewer pretty much just lets you hover over a polygon and get a tooltip with the metadata: But maybe we'd also want some filters like

I can help mentor with whatever Rust is needed. Let me know if you have any particular goal in mind; this issue is quite under-specified right now!

Robinlovelace commented 4 years ago

Hi @dabreegster, great, thanks for suggestions, all sound reasonable and worthy to me. One quick comment on this:

Come up with a proper data format for accident data, if one doesn't already exist. Or this might not be worth it, if different agencies use very different methods for classifying accidents.

I think that would indeed be useful and is something I've thought about. In the interests of modularity I think a generalised data format (or perhaps better, schema?) would be useful, that various crash data types could be 'shoehorned' into. That is a big data abstraction task that is worth doing as a self-standing project IMO, allowing abstreet and other projects to build-on. No problem with such a project starting life as code/conventions in this project where we can play around and split-out the core logic in a language-agnostic way later down the line (I was thinking of a generic road crash data R package but think the solution should be more generalised and language agnostic than that and a low level language like Rust that can compile to binary format for any OS could be a good tool for the job).

Regarding first thoughts on such a 'crash schema' the STATS19 data that the stats19 package provides actually has a pretty decent structure that I think could be the basis of a generic crash data file format/schema. The key tables and variables of this schema could look something like this:

The crash table (called the 'accident' table in STATS19 data but renamed because road safety campaigners urge everyone to avoid the term 'accident' for good reason) with event level data. Key variables:
- Time (in a datetime type)
- Location (lon/lat probably good standard here)
- N vehicles involved
- N casualties
- Severity (Slight, Serious, Fatal being good default levels from STATS19 IMO)
- ... any other variables related to the crash event ranging from road and light conditions to police/collection records and metadata such method of data collection e.g. crowd sourced (may be less reliable) police force or health body that collected the record
The casualty data (info on the person killed/injured, e.g. age/sex/mode of travel they were using when hurt)
Vehicle data (e.g. age/type/engine capacity of the vehicles involved)

In STATS19 these tables are linked by crash, casualty and vehicle id columns. A relational DB is probably overkill and just some key columns in the crash table will likely be sufficient for this issue for now, but thought it worth laying it all out.

Re visualisation, there are some nice images here in a Department for Transport project I'm leading showing our attempts to aggregate crash data to meaningful geographic entities (junctions and road sections) that could fit into the vis side: https://github.com/saferactive/saferactive/issues/36

I know the dangers associated with 'mission creep' so agree that just getting the data in one place would be a good start. Re case study city, I think Leeds would be idea, as shown below in the reproducible example below (requires an up-to-date R installation with spatial libraries such as GDAL).

Robinlovelace commented 4 years ago

Here's a reproducible example in R in case of use/interest getting UK crashes for 2019 FYI:

# Aim: get sub area of osm area
remotes::install_github("itsleeds/osmextract")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'osmextract' from a github remote, the SHA1 (6f1ab444) has not changed since last install.
#>   Use `force = TRUE` to force installation
sub_area_name = "leeds"
area_name = "west yorkshire"
area_osm_lines = osmextract::oe_get(place = area_name)
sub_area_bbox = tmaptools::geocode_OSM(q = sub_area_name, as.sf = TRUE)
sub_area_bbox$bbox
#> Geometry set for 1 feature 
#> geometry type:  POLYGON
#> dimension:      XY
#> bbox:           xmin: -1.800421 ymin: 53.69897 xmax: -1.290352 ymax: 53.94587
#> geographic CRS: WGS 84
#> POLYGON ((-1.800421 53.69897, -1.800421 53.7014...
sub_area_osm_lines = area_osm_lines[sub_area_bbox$bbox, ]
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar
# mapview::mapview(sub_area_osm_lines) # takes ages...
sub_area_osm_main = dplyr::filter(sub_area_osm_lines, stringr::str_detect(highway, "pri|sec|trunk|round"))
mapview::mapview(sub_area_osm_main)

devtools::session_info()$platform
#>  setting  value                       
#>  version  R version 4.0.3 (2020-10-10)
#>  os       Ubuntu 18.04.5 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language en_GB:en                    
#>  collate  en_GB.UTF-8                 
#>  ctype    en_GB.UTF-8                 
#>  tz       Europe/London               
#>  date     2020-11-02

crashes = stats19::get_stats19(year = 2019, type = "accidents")
#> Files identified: DfTRoadSafety_Accidents_2019.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Accidents_2019.zip
#> Data already exists in data_dir, not downloading
#> Data saved at ~/stats19-data/DfTRoadSafety_Accidents_2019/Road Safety Data - Accidents 2019.csv
#> Reading in:
#> /home/robin/stats19-data/DfTRoadSafety_Accidents_2019/Road Safety Data - Accidents 2019.csv
#> date and time columns present, creating formatted datetime column
crashes_sf = stats19::format_sf(crashes, lonlat = TRUE)
#> 28 rows removed with no coordinates
crashes_sub_area = crashes_sf[sub_area_bbox$bbox, ]
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar
names(crashes_sub_area)
#>  [1] "accident_index"                             
#>  [2] "location_easting_osgr"                      
#>  [3] "location_northing_osgr"                     
#>  [4] "police_force"                               
#>  [5] "accident_severity"                          
#>  [6] "number_of_vehicles"                         
#>  [7] "number_of_casualties"                       
#>  [8] "date"                                       
#>  [9] "day_of_week"                                
#> [10] "time"                                       
#> [11] "local_authority_district"                   
#> [12] "local_authority_highway"                    
#> [13] "first_road_class"                           
#> [14] "first_road_number"                          
#> [15] "road_type"                                  
#> [16] "speed_limit"                                
#> [17] "junction_detail"                            
#> [18] "junction_control"                           
#> [19] "second_road_class"                          
#> [20] "second_road_number"                         
#> [21] "pedestrian_crossing_human_control"          
#> [22] "pedestrian_crossing_physical_facilities"    
#> [23] "light_conditions"                           
#> [24] "weather_conditions"                         
#> [25] "road_surface_conditions"                    
#> [26] "special_conditions_at_site"                 
#> [27] "carriageway_hazards"                        
#> [28] "urban_or_rural_area"                        
#> [29] "did_police_officer_attend_scene_of_accident"
#> [30] "lsoa_of_accident_location"                  
#> [31] "datetime"                                   
#> [32] "geometry"
mapview::mapview(sub_area_osm_main) +
  mapview::mapview(crashes_sub_area["accident_severity"])

^{Created on 2020-11-02 by the reprex package (v0.3.0)}

dabreegster commented 4 years ago

Sorry this response took so long.

About a common format: I think prototyping some things here could be useful. A relational DB feels like overkill -- the same vehicle or injured person shouldn't be in multiple rows, I'd guess. Even if it happens in reality, I'd be quite surprised from a privacy perspective if any agencies exposed a single identifier for a person or vehicle over time. Squeezing the data into a flattened table may not be the most natural fit; something like protocol buffers would be multi-language, expressable in efficient binary format or a more readable text format, and allow for hierarchical data, enums, oneof, etc. There are different protobuf implementations floating around; I'm just advocating for something with a machine-readable schema, to get type safety in languages that make use of it.

Glancing at the Seattle dataset, the variables you listed seem reasonable. They list number of injuries, serious injuries, and fatalities, so I'm not immediately sure how to summarize a single severity necessarily.

Re visualisation, there are some nice images here in a Department for Transport project I'm leading showing our attempts to aggregate crash data to meaningful geographic entities (junctions and road sections) that could fit into the vis side:

The examples there seem like a good first step to build here. We'd snap all the raw accident data to the nearest road segment or intersection, then color by the number of casualties. We could add some filters for time of day, weather conditions, etc (whatever else is in the dataset), and recompute the heatmap when these change.

Here's a reproducible example in R in case of use/interest getting UK crashes for 2019 FYI

I haven't used R before, but I'm quite inspired by how easy it is to follow these examples. When I get some time tomorrow or Wednesday, I'll try exporting some of this data using R to KML or CSV or some other strawman format. Then we can start prototyping a UI to the dataviz described above -- I can get things started or let you work from examples, whatever you prefer.

dabreegster commented 4 years ago

I started poking around in https://github.com/dabreegster/abstreet/tree/leeds_accidents. All the code does so far is grab the 2019 csv file, extract lat/lon, time, and severity, and write it to a file that A/B Street can view. If you unzip the attached file and put it in data/input/leeds, then use the KML viewer in the main menu > devtools to open this file, you can zoom in and see the raw data: Screenshot from 2020-11-03 18-03-46 2019_accidents.bin.zip

I'm going to make the built-in KML viewer handle .csv files directly, to do initial data exploration more conveniently.

dabreegster commented 4 years ago

Oh yeah, I'll mention that I tried to use the stats19 package and export the cleaned-up data to KML or GeoJSON, but I'm still struggling with writeOGR. Ultimately the importer pipeline here in abst could either read the CSV files directly and do all of the transformation into some prototype of a common format, or the pipeline could invoke an R script to do some of this instead.

Robinlovelace commented 4 years ago

Oh yeah, I'll mention that I tried to use the stats19 package and export the cleaned-up data to KML or GeoJSON, but I'm still struggling with writeOGR. Ultimately the importer pipeline here in abst could either read the CSV files directly and do all of the transformation into some prototype of a common format, or the pipeline could invoke an R script to do some of this instead.

Great you're looking at integration. I think sf::write_sf() may be what you're looking for to export data as a kml, geojson or other standard geo data format with GDAL.

Robinlovelace commented 4 years ago

It's great to see crashes on the map though, fantastic progress, will give it a try when I get a spare moment.

dabreegster commented 4 years ago

Found http://seattlecollisions.timganter.io/collisions/sd/2010-11-12/ed/2020-11-12/m/1/nelat/47.72585823292033/nelng/-122.32766741974048/swlat/47.713053596999394/swlng/-122.34054202301196 through an SNG list, and wow! This dataviz rocks.

Robinlovelace commented 3 years ago

Very nice! What's an SNG list though?

dabreegster commented 3 years ago

http://seattlegreenways.org/, one of the local advocacy groups. Somebody on one of their mailing lists linked to that dataviz.

dabreegster commented 3 years ago

@Robinlovelace pointed out that the suspicion at https://github.com/dabreegster/abstreet/blob/82c1495cc4ead40ec59436f165e4462630eadb24/collisions/src/lib.rs#L65 is true -- the severity for stats19 data is backwards

Robinlovelace commented 3 years ago

Worth adding the data per region in clean form (e.g. after post-processing by the stats19 R package)?

In discussions around the SaferActive project I think pre-processing can add value to the data in various ways, e.g. by adding largest vehicle involved to casualty type to find out who hit who: https://www.saferactive.org/

I'm happy to do some of that post-processing.

dabreegster commented 3 years ago

It would be great to consume a more cleaned-up dataset here! The collision schema and viewer is still just in the prototype stage without any thought-out design or plans for what the tool should do.

(Although one recent feature request for the 15m tool that I hope to prioritize is to optionally show and avoid walking on roads with high pedestrian fatalities)

a-b-street / abstreet

New data layer: accidents / collisions #87