desmarais-lab / govWebsites

1 stars 1 forks source link

some websites can't be downloaded anymore #8

Closed markusneumann closed 5 years ago

markusneumann commented 6 years ago

As noted previously, there are 30 websites on our list of URLs that SHOULD be downloadable, but for some reason aren't. I've now explicitly tried to focus only on these, but wget just can't get them. Notably, this includes Philadelphia and, much more importantly, New Orleans. NOLA was still downloadable back in the fall and contained quite a lot of content, so the question is: do we still include our old version of the website in the paper?

bdesmarais commented 6 years ago

30 seems like a lot to just ignore, but we are limited in terms of replicability limit if we use the old files. Can you get files from those sites using a scheme like this?

https://stackoverflow.com/questions/33790052/download-all-files-from-a-folder-on-a-website


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Tue, Apr 10, 2018 at 2:39 PM, Markus Neumann notifications@github.com wrote:

As noted previously, there are 30 websites on our list of URLs that SHOULD be downloadable, but for some reason aren't. I've now explicitly tried to focus only on these, but wget just can't get them. Notably, this includes Philadelphia and, much more importantly, New Orleans. NOLA was still downloadable back in the fall and contained quite a lot of content, so the question is: do we still include our old version of the website in the paper?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/govWebsites/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKWL3wb0RxYmdC17sgNX99splQ6qiks5tnPxwgaJpZM4TOymi .

markusneumann commented 6 years ago

To be clear: of those 30 websites, only a few were previously downloadable, most of them are new or couldn't be downloaded before either. The issue title may be a bit misleading.

Also, the number is actually down to 28 now.

The fact that some websites can't be downloaded isn't really new - we always had this problem. The only thing that confused me here is the fact that NOLA was previously downloadable. The fact that 28 out of 314 websites can't be downloaded isn't actually too bad - I previously thought it might end up being a lot more.

markusneumann commented 6 years ago

Also, as far as I can tell, the method you linked is just a way to find all the links on a given page. I was already using that for much of the web scraping (such as getting city URLs from Wikipedia, etc.). In theory, yes, I suppose that would be the central component of any web crawler, but I think it would probably take a bit more than that (and a lot more time) to build one (judging by the source code of rcrawler for example).

Don't get me wrong, I would love it if we could do all the website downloading in R, as it would solve a number of annoying problems, but if we want to get this paper ready in May, I don't think building our own web crawler is an option.

markusneumann commented 5 years ago

Nothing we can do about this.