Open PublicHealthDataGeek opened 2 years ago
The changed column name to 'authority' which appears to work but there is an outstanding issue of extracting other useful content. Potential examples include the date the notice was signed and the name and role of the signatory. However, it may be that these have inconsistent locations in notices. See the below code for some examples - one gets the date and administrator, the other doesnt
url = "https://www.thegazette.co.uk/notice/3723277.html"
test = rvest::read_html(url)
url2 = "https://www.thegazette.co.uk/notice/3487301"
test2 = read_html(url2)```
css_selector_date_signed = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(1) > span:nth-child(1)"
css_selector_date_signed2 = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(1)"
date_signed = test2 %>% html_element(css = css_selector_date_signed) %>% html_text2()
date_signed2 = test2 %>% html_element(css = css_selector_date_signed2) %>% html_text2()
css_selector_administrator = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(3)"
adminstrator = test2 %>% html_element(css = css_selector_administrator) %>% html_text2()
css_selector_administrator_name = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(2)"
administrtor_name = test2 %>% html_element(css = css_selector_administrator_name) %>% html_text2()
Not quite working as well as hoped. ? may need a sub-subtitle and another authority column or similar. However, looking at the linked data, some of these notices do not have 'authorisers' e.g. https://www.thegazette.co.uk/notice/3674552?view=linked-data
library(GazetteR)
library(tidyverse)
environment = get_gazette_feed(categorycode = 18,
start_publish_date = "01/01/2020",
end_publish_date = "31/12/2020", tidy = TRUE) # searches the Gazette for Environmental Notices
clean_air_2020 = environment %>% filter(notice_code == 1801) # filters on notice code for clean air
clean_air_2020b = environment %>% filter(title == "Clean Air") # or can filter on title of Clean Air
clean_air_2020_notice_ids = clean_air_2020 %>%
pull(notice_id) # pulls the unique notice ids from that dataframe
clean_air_2020_notice_contents = get_content(clean_air_2020_notice_ids, search_terms = "x")
# gets the content for the notices we are interested in
# search term can be particular words or combinations
# Can join this dataframe to the results of our get_gazette_feed so that the data is all in one dataframe
final_clean_air_2020 = left_join(clean_air_2020_notice_contents, clean_air_2020)
#> Joining, by = "notice_id"
# Show table
final_clean_air_2020[, 1:5] %>%
knitr::kable(format = "markdown")
notice_id | pub_date | authority | subtitle | enabling_legislation |
---|---|---|---|---|
3674552 | 2020-11-12 | NA | ENVIRONMENT ACT 1995 | DIRECTION |
3670708 | 2020-11-06 | Scottish Government | Actions | CLEANER AIR FOR SCOTLAND 2 |
3582637 | 2020-06-24 | NA | ENVIRONMENT ACT 1995 | DIRECTION |
3582636 | 2020-06-24 | NA | ENVIRONMENT ACT 1995 | DIRECTION |
3582578 | 2020-06-24 | NA | ENVIRONMENT ACT 1995 | DIRECTION |
3582577 | 2020-06-24 | NA | ENVIRONMENT ACT 1995 | DIRECTION |
3511129 | 2020-03-04 | NA | ENVIRONMENT ACT 1995 | DIRECTION |
3511127 | 2020-03-04 | NA | ENVIRONMENT ACT 1995 | DIRECTION |
3511128 | 2020-03-04 | NA | ENVIRONMENT ACT 1995 | DIRECTION |
Created on 2022-01-18 by the reprex package (v2.0.1)
For example, 3723277 returns a 'borough' of Guildford which appears to be the location of the Highways England office. Initially it was thought this html_node("span") extracted the administrative agent eg Borough but clearly not. Function may need rewriting to reflect diversity of potential html content in different notices.