cssearcy / AYS-R-Coding-SPR-2020

Coding in R for Policy Analytics
https://cssearcy.github.io/AYS-R-Coding-SPR-2020/
3 stars 3 forks source link

Extracting PDF URLs from a webpage #30

Closed aseyingbo1 closed 3 years ago

aseyingbo1 commented 3 years ago

Hi everyone,

I am new in R and trying to extract PDF URLs from this website. My goal is to download the PDFs from 2017 till date.

(https://dekalbcountyga.legistar.com/DepartmentDetail.aspx?ID=29350&GUID=A51C5572-654E-4DD9-A867-093BF2943C47&R=481ebb46-ce9d-4d0f-a1e2-a3e7895be9c2

Here is my code:

webpage_url <- "https://dekalbcountyga.legistar.com/DepartmentDetail.aspx?ID=29350&GUID=A51C5572-654E-4DD9-A867-093BF2943C47&R=481ebb46-ce9d-4d0f-a1e2-a3e7895be9c2"
webpage <- read_html(webpage_url)
link <- webpage %>%
  html_nodes("td:nth-child(8) a")%>%  #Using the selector gadget   
  html_attr("href")

I got NA NA. Someone kindly assist.

Thank you.

@jamisoncrawford @lecy

jamisoncrawford commented 3 years ago

Hi @aseyingbo1!

Which PDF are you trying to extract? The link to the agenda?

You'll want to use the download url itself so that it can immediately be read into R. Right click on the link and select Copy link address, like so:

image

Here is the url to download the PDF:

https://dekalbcountyga.legistar.com/View.ashx?M=A&ID=751271&GUID=31C18E46-AF97-4F40-B497-1FFA1CBE55FA

A good package for this is pdftools. Try the following code to read this into R - it will be unformatted, plain text, as well as somewhat messy, but you'll need to use string/text manipulation to clean it up!

install.packages("pdftools")
library(pdftools)

url <- "https://dekalbcountyga.legistar.com/View.ashx?M=A&ID=751271&GUID=31C18E46-AF97-4F40-B497-1FFA1CBE55FA"

agenda <- pdf_text(url)

Let me know if this works for you!

aseyingbo1 commented 3 years ago

Thank you very much @jamisoncrawford. I was trying to extract the 'Minute Summary' PDFs from 2017 till date. I could do for a single PDF by copying and pasting the PDF URL, and then using the download.file function in base R, but struggled locating the target node for all hrefs.

What I love to do is to extract all the PDF urls in a dataframe but it seems web-scraping an aspx site presents some unique challenges I am having issues with.

Thanks for your help.