itsleeds / tds

Transport Data Science
https://itsleeds.github.io/tds/
Creative Commons Zero v1.0 Universal
41 stars 46 forks source link

How to get data from government sources? #72

Closed Robinlovelace closed 3 years ago

Robinlovelace commented 3 years ago

Many sites provide government data, primarily https://data.gov.uk/

Transport data can be found here: https://data.gov.uk/search?filters%5Btopic%5D=Transport

Geographic data can be found here: https://data.gov.uk/search?filters%5Btopic%5D=Mapping

and here: https://geoportal.statistics.gov.uk/

See here for more info on places to get transport data: https://github.com/ITSLeeds/TDS/blob/master/coursework-template.md#datasets

Robinlovelace commented 3 years ago

Taking an example from waste management, the R code below shows how to get data into R:

# Aim: demonstrate how to get data from UK gov sources

library(tidyverse)

# First stage: search - I searched on Google for:
# https://www.google.com/search?q=recycling+points+leeds+geojson

# That took me here:
# https://datamillnorth.org/publisher/leedscitycouncil?format=geojsonhttps://datamillnorth.org/publisher/leedscitycouncil?format=geojson

# That linked me to:
# https://datamillnorth.org/dataset/bring-sites

u = "https://datamillnorth.org/download/bring-sites/97bd60ae-ced3-4ddc-996e-f1ebe5d21136/tonnage%20to%20Sept%2020.csv"
waste_sites = read_csv(u)
waste_sites
dim(waste_sites)
skimr::skim(waste_sites)
waste_sites_unique = waste_sites %>%
  # remove missing values: data cleaning
  filter(! is.na(Longitude)) %>% 
  group_by(`Site Name`, Longitude, Latitude) %>% 
  summarise(tonnes_glass = sum(`Apr 20 Glass Tonnage Kg`, `May 20 Glass Tonnage Kg`))

waste_sites_sf = sf::st_as_sf(waste_sites_unique, coords = c("Longitude", "Latitude"), crs = 4326)

mapview::mapview(waste_sites_sf)

# remove sites that are miles away
pct_regions = pct::pct_regions
waste_sites_clean = waste_sites_sf[pct_regions, ]

library(tmap)  
tm_shape(waste_sites_clean) +
  tm_dots(size = "tonnes_glass")
Robinlovelace commented 3 years ago

See result of running that code below.

``` r # Aim: demonstrate how to get data from UK gov sources library(tidyverse) # First stage: search - I searched on Google for: # https://www.google.com/search?q=recycling+points+leeds+geojson # That took me here: # https://datamillnorth.org/publisher/leedscitycouncil?format=geojsonhttps://datamillnorth.org/publisher/leedscitycouncil?format=geojson # That linked me to: # https://datamillnorth.org/dataset/bring-sites u = "https://datamillnorth.org/download/bring-sites/97bd60ae-ced3-4ddc-996e-f1ebe5d21136/tonnage%20to%20Sept%2020.csv" waste_sites = read_csv(u) #> Warning: Missing column names filled in: 'X29' [29], 'X30' [30], 'X31' [31], #> 'X32' [32], 'X33' [33], 'X34' [34], 'X35' [35], 'X36' [36], 'X37' [37], #> 'X38' [38], 'X39' [39], 'X40' [40], 'X41' [41], 'X42' [42], 'X43' [43], #> 'X44' [44], 'X45' [45], 'X46' [46], 'X47' [47], 'X48' [48], 'X49' [49], #> 'X50' [50], 'X51' [51], 'X52' [52], 'X53' [53], 'X54' [54], 'X55' [55], #> 'X56' [56], 'X57' [57], 'X58' [58], 'X59' [59] #> #> ── Column specification ──────────────────────────────────────────────────────── #> cols( #> .default = col_logical(), #> `Site Name` = col_character(), #> Address = col_character(), #> `Post Code` = col_character(), #> `Location Type` = col_character(), #> Ward = col_character(), #> Sticker = col_character(), #> `Postcode area` = col_character(), #> `Site Number` = col_character(), #> UPRN = col_character(), #> Latitude = col_double(), #> Longitude = col_double(), #> y = col_double(), #> x = col_double(), #> `Recyclables-Mixed Glass` = col_character(), #> `Recyclables-Paper` = col_character(), #> `Recyclables-Textiles` = col_character(), #> `Recyclables Small Electrical` = col_character(), #> Type = col_character(), #> `Tally of Sites` = col_double(), #> `No of Glass Banks` = col_double() #> # ... with 6 more columns #> ) #> ℹ Use `spec()` for the full column specifications. waste_sites #> # A tibble: 403 x 59 #> `Site Name` Address `Post Code` `Location Type` Ward Sticker `Postcode area` #> #> 1 Armley Con… Armley… LS12 3NU Social Club Arml… LS12 #> 2 Abbey Inn Pollar… LS13 1EQ Pub Bram… Y LS13 #> 3 Acorn Publ… Leeds … LS13 1NS Pub Bram… Y LS13 #> 4 Adel War M… Church… LS16 8DG Social Club Adel… Y LS16 #> 5 Adrian Soc… Belle … LS8 2NN Social Club Roun… Y LS8 #> 6 Adwalton C… Moorsi… BD11 1HG Sports Club Kipp… BD11 #> 7 Aireboroug… St Osw… LS20 9BT Leisure Centre Guis… Y LS20 #> 8 Allied Gla… Haigh … LS10 1NQ Business City… LS10 #> 9 Alwoodley … The Av… LS17 7NZ Social Club Alwo… Y LS17 #> 10 Anchor Inn Victor… WF10 2DF Pub Kipp… Y WF10 #> # … with 393 more rows, and 52 more variables: Site Number , UPRN , #> # Latitude , Longitude , y , x , #> # Recyclables-Mixed Glass , Recyclables-Paper , #> # Recyclables-Textiles , Recyclables Small Electrical , #> # Non LCC Banks , Type , Tally of Sites , Cleaning , #> # No of Glass Banks , Apr 20 Glass Tonnage Kg , #> # May 20 Glass Tonnage Kg , Jun 20 Glass Tonnage Kg , #> # Jul 20 Glass Tonnage Kg , Aug 20 Glass Tonnage Kg , #> # Sep 20 Glass Tonnage Kg , X29 , X30 , X31 , X32 , #> # X33 , X34 , X35 , X36 , X37 , X38 , #> # X39 , X40 , X41 , X42 , X43 , X44 , #> # X45 , X46 , X47 , X48 , X49 , X50 , #> # X51 , X52 , X53 , X54 , X55 , X56 , #> # X57 , X58 , X59 dim(waste_sites) #> [1] 403 59 skimr::skim(waste_sites) ```
Data summary
Name waste_sites
Number of rows 403
Number of columns 59
_______________________
Column type frequency:
character 14
logical 33
numeric 12
________________________
Group variables None
Data summary **Variable type: character**
skim_variable n_missing complete_rate min max empty n_unique whitespace
Site Name 0 1.00 5 56 0 332 0
Address 0 1.00 6 56 0 380 0
Post Code 1 1.00 7 8 0 380 0
Location Type 2 1.00 3 20 0 24 0
Ward 3 0.99 6 29 0 41 0
Sticker 153 0.62 1 1 0 2 0
Postcode area 10 0.98 3 4 0 30 0
Site Number 39 0.90 1 7 0 352 0
UPRN 17 0.96 3 8 0 378 0
Recyclables-Mixed Glass 24 0.94 1 1 0 2 0
Recyclables-Paper 350 0.13 1 1 0 1 0
Recyclables-Textiles 349 0.13 1 1 0 1 0
Recyclables Small Electrical 383 0.05 1 1 0 1 0
Type 8 0.98 5 64 0 23 0
**Variable type: logical**
skim_variable n_missing complete_rate mean count
Non LCC Banks 403 0 NaN :
Cleaning 403 0 NaN :
X29 403 0 NaN :
X30 403 0 NaN :
X31 403 0 NaN :
X32 403 0 NaN :
X33 403 0 NaN :
X34 403 0 NaN :
X35 403 0 NaN :
X36 403 0 NaN :
X37 403 0 NaN :
X38 403 0 NaN :
X39 403 0 NaN :
X40 403 0 NaN :
X41 403 0 NaN :
X42 403 0 NaN :
X43 403 0 NaN :
X44 403 0 NaN :
X45 403 0 NaN :
X46 403 0 NaN :
X47 403 0 NaN :
X48 403 0 NaN :
X49 403 0 NaN :
X50 403 0 NaN :
X51 403 0 NaN :
X52 403 0 NaN :
X53 403 0 NaN :
X54 403 0 NaN :
X55 403 0 NaN :
X56 403 0 NaN :
X57 403 0 NaN :
X58 403 0 NaN :
X59 403 0 NaN :
**Variable type: numeric**
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Latitude 4 0.99 53.81 0.05 53.72 53.78 53.81 53.83 53.94 ▃▇▇▂▂
Longitude 4 0.99 -1.52 0.29 -1.73 -1.61 -1.55 -1.48 1.59 ▇▁▁▁▁
y 4 0.99 435026.93 5276.03 425053.00 431561.50 434782.00 437156.50 449157.00 ▅▇▇▂▂
x 4 0.99 430223.87 6230.77 417833.00 426014.50 429440.00 434559.50 445532.00 ▃▇▇▃▃
Tally of Sites 33 0.92 1.01 0.16 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
No of Glass Banks 42 0.90 1.89 1.66 0.00 1.00 1.00 3.00 12.00 ▇▂▁▁▁
Apr 20 Glass Tonnage Kg 145 0.64 3886.02 6212.30 0.00 528.75 1562.00 3902.25 40130.00 ▇▁▁▁▁
May 20 Glass Tonnage Kg 145 0.64 4918.67 8277.63 0.00 619.50 1714.50 4443.25 55487.00 ▇▁▁▁▁
Jun 20 Glass Tonnage Kg 133 0.67 5172.92 8200.52 0.00 654.75 1898.50 5113.75 51871.00 ▇▁▁▁▁
Jul 20 Glass Tonnage Kg 133 0.67 4307.68 7254.31 0.00 628.64 1512.59 4572.34 50283.99 ▇▁▁▁▁
Aug 20 Glass Tonnage Kg 151 0.63 4014.84 6553.86 62.64 640.01 1444.75 3772.38 45626.10 ▇▁▁▁▁
Sep 20 Glass Tonnage Kg 97 0.76 3172.18 5501.54 0.00 506.72 1076.65 2512.98 37572.20 ▇▁▁▁▁
``` r waste_sites_unique = waste_sites %>% # remove missing values: data cleaning filter(! is.na(Longitude)) %>% group_by(`Site Name`, Longitude, Latitude) %>% summarise(tonnes_glass = sum(`Apr 20 Glass Tonnage Kg`, `May 20 Glass Tonnage Kg`)) #> `summarise()` has grouped output by 'Site Name', 'Longitude'. You can override using the `.groups` argument. waste_sites_sf = sf::st_as_sf(waste_sites_unique, coords = c("Longitude", "Latitude"), crs = 4326) mapview::mapview(waste_sites_sf) ``` ![](https://i.imgur.com/xJ3Ng8G.png) ``` r # remove sites that are miles away pct_regions = pct::pct_regions waste_sites_clean = waste_sites_sf[pct_regions, ] #> although coordinates are longitude/latitude, st_intersects assumes that they are planar #> although coordinates are longitude/latitude, st_intersects assumes that they are planar library(tmap) tm_shape(waste_sites_clean) + tm_dots(size = "tonnes_glass") #> Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1 #> Legend labels were too wide. Therefore, legend.text.size has been set to 0.66. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger. #> The legend is too narrow to place all symbol sizes. ``` ![](https://i.imgur.com/ftHQlCK.png) Created on 2021-03-22 by the [reprex package](https://reprex.tidyverse.org) (v1.0.0)