Closed aseyingbo1 closed 3 years ago
Hi @aseyingbo1!
Which PDF are you trying to extract? The link to the agenda?
You'll want to use the download url itself so that it can immediately be read into R. Right click on the link and select Copy link address
, like so:
Here is the url to download the PDF:
https://dekalbcountyga.legistar.com/View.ashx?M=A&ID=751271&GUID=31C18E46-AF97-4F40-B497-1FFA1CBE55FA
A good package for this is pdftools
. Try the following code to read this into R - it will be unformatted, plain text, as well as somewhat messy, but you'll need to use string/text manipulation to clean it up!
install.packages("pdftools")
library(pdftools)
url <- "https://dekalbcountyga.legistar.com/View.ashx?M=A&ID=751271&GUID=31C18E46-AF97-4F40-B497-1FFA1CBE55FA"
agenda <- pdf_text(url)
Let me know if this works for you!
Thank you very much @jamisoncrawford. I was trying to extract the 'Minute Summary' PDFs from 2017 till date. I could do for a single PDF by copying and pasting the PDF URL, and then using the download.file
function in base R, but struggled locating the target node for all hrefs
.
What I love to do is to extract all the PDF urls in a dataframe but it seems web-scraping an aspx
site presents some unique challenges I am having issues with.
Thanks for your help.
Hi everyone,
I am new in R and trying to extract PDF URLs from this website. My goal is to download the PDFs from 2017 till date.
(https://dekalbcountyga.legistar.com/DepartmentDetail.aspx?ID=29350&GUID=A51C5572-654E-4DD9-A867-093BF2943C47&R=481ebb46-ce9d-4d0f-a1e2-a3e7895be9c2
Here is my code:
I got NA NA. Someone kindly assist.
Thank you.
@jamisoncrawford @lecy