Check the web content that is extracted by get_notice_content

PublicHealthDataGeek commented 2 years ago

For example, 3723277 returns a 'borough' of Guildford which appears to be the location of the Highways England office. Initially it was thought this html_node("span") extracted the administrative agent eg Borough but clearly not. Function may need rewriting to reflect diversity of potential html content in different notices.

PublicHealthDataGeek commented 2 years ago

The changed column name to 'authority' which appears to work but there is an outstanding issue of extracting other useful content. Potential examples include the date the notice was signed and the name and role of the signatory. However, it may be that these have inconsistent locations in notices. See the below code for some examples - one gets the date and administrator, the other doesnt

url = "https://www.thegazette.co.uk/notice/3723277.html"
test = rvest::read_html(url)
url2 = "https://www.thegazette.co.uk/notice/3487301"
test2 = read_html(url2)```

css_selector_date_signed = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(1) > span:nth-child(1)"
css_selector_date_signed2 = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(1)"
date_signed = test2 %>% html_element(css = css_selector_date_signed) %>% html_text2()
date_signed2 = test2 %>% html_element(css = css_selector_date_signed2) %>% html_text2()

css_selector_administrator = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(3)"
adminstrator = test2 %>% html_element(css = css_selector_administrator) %>% html_text2()
css_selector_administrator_name = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(2)"
administrtor_name = test2 %>% html_element(css = css_selector_administrator_name) %>% html_text2()

PublicHealthDataGeek commented 2 years ago

Not quite working as well as hoped. ? may need a sub-subtitle and another authority column or similar. However, looking at the linked data, some of these notices do not have 'authorisers' e.g. https://www.thegazette.co.uk/notice/3674552?view=linked-data

library(GazetteR)
library(tidyverse)

environment = get_gazette_feed(categorycode = 18,
                     start_publish_date = "01/01/2020",
                     end_publish_date = "31/12/2020", tidy = TRUE) # searches the Gazette for Environmental Notices

clean_air_2020 = environment %>% filter(notice_code == 1801) # filters on notice code for clean air
clean_air_2020b = environment %>% filter(title == "Clean Air") # or can filter on title of Clean Air

clean_air_2020_notice_ids = clean_air_2020 %>%
  pull(notice_id) # pulls the unique notice ids from that dataframe

clean_air_2020_notice_contents = get_content(clean_air_2020_notice_ids, search_terms = "x") 
# gets the content for the notices we are interested in
# search term can be particular words or combinations

# Can join this dataframe to the results of our get_gazette_feed so that the data is all in one dataframe
final_clean_air_2020 = left_join(clean_air_2020_notice_contents, clean_air_2020)
#> Joining, by = "notice_id"

# Show table
final_clean_air_2020[, 1:5] %>%
  knitr::kable(format = "markdown")

notice_id	pub_date	authority	subtitle	enabling_legislation
3674552	2020-11-12	NA	ENVIRONMENT ACT 1995	DIRECTION
3670708	2020-11-06	Scottish Government	Actions	CLEANER AIR FOR SCOTLAND 2
3582637	2020-06-24	NA	ENVIRONMENT ACT 1995	DIRECTION
3582636	2020-06-24	NA	ENVIRONMENT ACT 1995	DIRECTION
3582578	2020-06-24	NA	ENVIRONMENT ACT 1995	DIRECTION
3582577	2020-06-24	NA	ENVIRONMENT ACT 1995	DIRECTION
3511129	2020-03-04	NA	ENVIRONMENT ACT 1995	DIRECTION
3511127	2020-03-04	NA	ENVIRONMENT ACT 1995	DIRECTION
3511128	2020-03-04	NA	ENVIRONMENT ACT 1995	DIRECTION

^{Created on 2022-01-18 by the reprex package (v2.0.1)}

PublicHealthDataGeek / GazetteR

Check the web content that is extracted by get_notice_content #3