HertieDataScience / SyllabusAndLectures

Hertie School of Governance Introduction to Collaborative Social Science Data Analysis
MIT License
37 stars 60 forks source link

Downloading a file from the internet - link on the page is to another page that redirects in a complicated way #43

Open mcallaghan opened 8 years ago

mcallaghan commented 8 years ago

Hello all,

I'm having trouble downloading files from the UK neighbourhood statistics page - here for example.

If I copy the download link location, and then try to use that link in R to download the zipped files, I just get a little webpage that functions as a browser functionality checker, and then presumably redirects. I've tried using various download utilities, and trying to construct an alternate link by interpreting what the webpage is trying to do.


library(RCurl)
library(downloader)

temp <- tempfile()
link <- "http://www.neighbourhood.statistics.gov.uk/dissemination/DownloadData.zip?$ph=60_61_65&step=6&downloadLargeFile=true&fileIndex=1"
getURI(link, .opts=curlOptions(followlocation=TRUE))

link2 <- "http://www.neighbourhood.statistics.gov.uk/dissemination/DownloadData.zip?$ph=60_61_65&amp;step=6&amp;downloadLargeFile=true&amp;fileIndex=1&amp;nsjs=false&amp;nsck=false&amp;nssvg=false"

download.file(link,temp,mode="wb")
download.file(link2,temp,mode="wb")

download(link,temp)

postForm(link)

When I look in my tmp directory, or see the output of the curl requests, I just see this webpage written by Neil Sillitoe at the ONS to check my browser settings.

I've even tried monitoring the http traffic with httpr, but can't construct a better link from that.

Obviously, if I just click the link I can unzip the file and work with it from there, but then it isn't all automated and that would make me very sad. I would also have to do that for every indicator for every year.

Has anyone else encountered a similar problem or solution? An internet search just brings me to a question answered by someone saying it's "a wonderful example of how not to do web design"

christophergandrud commented 8 years ago

"a wonderful example of how not to do web design"

This is definitely true and is directly related to what I was talking about in the lecture where practitioners often don't have a good sense of how to make their data easily available.

If you really want to automate the collection of this data you will probably need to go down the httr path. Here is a (kind of) related example.

The URL you have been working with basically posts a query to whatever program they have repackaging and zipping the data. That program then sends the results back to you. The values in the query are all the stuff after: DownloadData.zip?. That being said, I gave it a quick go and haven't yet successfully programatically make the query.

Worse comes to worse. You can just download the data in the point in click manner, fully documenting the source, and store the files in your repo.

Hope that helps!