PublicHealthDataGeek / GazetteR

R package to extract data from the Gazette
3 stars 0 forks source link

Check the web content that is extracted by get_notice_content #3

Open PublicHealthDataGeek opened 2 years ago

PublicHealthDataGeek commented 2 years ago

For example, 3723277 returns a 'borough' of Guildford which appears to be the location of the Highways England office. Initially it was thought this html_node("span") extracted the administrative agent eg Borough but clearly not. Function may need rewriting to reflect diversity of potential html content in different notices.

PublicHealthDataGeek commented 2 years ago

The changed column name to 'authority' which appears to work but there is an outstanding issue of extracting other useful content. Potential examples include the date the notice was signed and the name and role of the signatory. However, it may be that these have inconsistent locations in notices. See the below code for some examples - one gets the date and administrator, the other doesnt

url = "https://www.thegazette.co.uk/notice/3723277.html"
test = rvest::read_html(url)
url2 = "https://www.thegazette.co.uk/notice/3487301"
test2 = read_html(url2)```

css_selector_date_signed = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(1) > span:nth-child(1)"
css_selector_date_signed2 = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(1)"
date_signed = test2 %>% html_element(css = css_selector_date_signed) %>% html_text2()
date_signed2 = test2 %>% html_element(css = css_selector_date_signed2) %>% html_text2()

css_selector_administrator = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(3)"
adminstrator = test2 %>% html_element(css = css_selector_administrator) %>% html_text2()
css_selector_administrator_name = "div.content:nth-child(3) > div:nth-child(5) > p:nth-child(2)"
administrtor_name = test2 %>% html_element(css = css_selector_administrator_name) %>% html_text2()
PublicHealthDataGeek commented 2 years ago

Not quite working as well as hoped. ? may need a sub-subtitle and another authority column or similar. However, looking at the linked data, some of these notices do not have 'authorisers' e.g. https://www.thegazette.co.uk/notice/3674552?view=linked-data

library(GazetteR)
library(tidyverse)

environment = get_gazette_feed(categorycode = 18,
                     start_publish_date = "01/01/2020",
                     end_publish_date = "31/12/2020", tidy = TRUE) # searches the Gazette for Environmental Notices

clean_air_2020 = environment %>% filter(notice_code == 1801) # filters on notice code for clean air
clean_air_2020b = environment %>% filter(title == "Clean Air") # or can filter on title of Clean Air

clean_air_2020_notice_ids = clean_air_2020 %>%
  pull(notice_id) # pulls the unique notice ids from that dataframe

clean_air_2020_notice_contents = get_content(clean_air_2020_notice_ids, search_terms = "x") 
# gets the content for the notices we are interested in
# search term can be particular words or combinations

# Can join this dataframe to the results of our get_gazette_feed so that the data is all in one dataframe
final_clean_air_2020 = left_join(clean_air_2020_notice_contents, clean_air_2020)
#> Joining, by = "notice_id"

# Show table
final_clean_air_2020[, 1:5] %>%
  knitr::kable(format = "markdown")
notice_id pub_date authority subtitle enabling_legislation
3674552 2020-11-12 NA ENVIRONMENT ACT 1995 DIRECTION
3670708 2020-11-06 Scottish Government Actions CLEANER AIR FOR SCOTLAND 2
3582637 2020-06-24 NA ENVIRONMENT ACT 1995 DIRECTION
3582636 2020-06-24 NA ENVIRONMENT ACT 1995 DIRECTION
3582578 2020-06-24 NA ENVIRONMENT ACT 1995 DIRECTION
3582577 2020-06-24 NA ENVIRONMENT ACT 1995 DIRECTION
3511129 2020-03-04 NA ENVIRONMENT ACT 1995 DIRECTION
3511127 2020-03-04 NA ENVIRONMENT ACT 1995 DIRECTION
3511128 2020-03-04 NA ENVIRONMENT ACT 1995 DIRECTION

Created on 2022-01-18 by the reprex package (v2.0.1)