desmarais-lab / govWebsites

1 stars 1 forks source link

progress report #3

Closed bdesmarais closed 5 years ago

bdesmarais commented 6 years ago

We should be able to add many more cities to our data---casting a national web, if we can analyze local campaign contribution data to classify the partisanship of mayors. This would broaden the generalizability of our results.

markusneumann commented 6 years ago

NY has 62 cities, of these, the partisanship of the mayor can be determined through campaign finance data for 18 Democratic and 7 Republican cities OR does have the necessary data, but it can only be accessed by searching committee names for 'mayor' (otherwise only statewide offices) WA has an amazing website, partisan contributions are hard to find but do seem to exist

Other states I checked without success: CA, AZ, WY, UT, ID, SD, MT, AL, KS, TN, ND, NE, CO, IL

markusneumann commented 6 years ago

Right now, we need to decide whether the open search-based approach in Oregon is valid. The problem is that committees for mayoral candidates don't always have the same types of names. A lot of the time, it is 'X for mayor', which is sufficient for our purpose. Another very common name however is 'Friends of X', and there are other patterns as well which don't indicate whether someone is a candidate for mayor. Consequently it will be very difficult to get a complete list of mayoral candidates in Oregon.

In the meantime, I will scrape the websites of NY cities and scrape the WA campaign finance website to see whether there are enough party-based contributions for the method to work for that state.

bdesmarais commented 6 years ago

In OR, we don't need a list of mayoral candidates---just those who won. Can you find a secondary listing of mayors (perhaps on the cities' wikipedia pages?

markusneumann commented 6 years ago

Finding the secondary listing of mayors worked fine - out of 241 cities, I could find the names of 177 mayors (whose cities also had websites). Unfortunately, campaign finance data was available for only 32 of them, of these, 19 had non-empty data, and of these, 5 were Democrats, 14 non-partisan and 0 Republicans.

bdesmarais commented 6 years ago

This site lists 22 Dem mayors in Oregon http://dpo.org/elected-officials. I couldn't find a similar listing for Republicans, but if we can find a list like this for Oregon or other states we could add lists of officials.

markusneumann commented 6 years ago

I've looked around a bit, but I couldn't find anything similar. Comparable websites for other states do exist, but they only ever list state-level officials. Also, the website for Oregon Democrats may not be 100% reliable, because two mayors are listed for Portland, one of which is the old one (although granted, that doesn't make the data unusable).

That being said, I have found a way to increase our sample size: When I scraped the city URLs for Washington, I created a more thorough algorithm which tries to get the city website URL from Wikipedia in three different ways (so if one fails, it tries the next one, etc.). I applied this to Indiana, Louisiana and New York, and it turned out to be unexpectedly effective.

Furthermore, I have now also scraped mayoral partisanship from Wikipedia for IN, LA and NY. For New York, this proved to be particularly effective (although it means that we would need to consider Wikipedia as a sufficiently trustworthy data source - which, after all the manual checking I did, I think it is).

Finally, rather than dismissing errors arising from this process as missing data, I went back and corrected them by hand.

As a result of all this, the number of URLs we have available for LA and NY has roughly doubled, and for IN, it has tripled.

I have combined this with the data for the 100 largest cities in the US (from Ballotpedia) as well as the GSA .gov URLs. This means we now have one dataset which contains all of this information (rather than previously, where I always had to do everything separately for IN and LA). Here are summary statistics for the number of cities for which we have information on party and the city website URL (note that these are upper bounds, because not all URLs can actually be scraped by wget, be it because they contain Javascript, don't work anymore. etc.):

Democratic Republican Total
Indiana 49 59 108
Louisiana 36 21 57
New York 36 16 52
Other 56 28 84
Washington 11 2 13
Total 188 126 314

For comparison, the version of the paper we presented at SPSA had 48 cities in total.

I've also added this table to the paper; as well as another which describes the number of cities per state. Notably, we have 15 cities in California (from the big cities data), which, together with WA, should solve our West Coast problem.

My next step will be to get the GPS coordinates for all cities from Wikipedia (I also gathered all the links to the biggest 100 cities' Wikipedia pages (and we already had all the other ones)) and then plot everything on a map.

bdesmarais commented 6 years ago

It is impressive how much you have been able to scale up this data over the last two months. We can proceed with this dataset for our first round of submission. Reviewers may comment on the fact that this is still sort of a convenience sample, but we could either re-weight to match census-based city attributes or use this data as training data for some other classifier. I think we should analyze all of the data with a single structural topic model, including state fixed effects.


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Fri, Mar 2, 2018 at 7:44 PM, Markus Neumann notifications@github.com wrote:

I've looked around a bit, but I couldn't find anything similar. Comparable websites for other states do exist, but they only ever list state-level officials. Also, the website for Oregon Democrats may not be 100% reliable, because two mayors are listed for Portland, one of which is the old one (although granted, that doesn't make the data unusable).

That being said, I have found a way to increase our sample size: When I scraped the city URLs for Washington, I created a more thorough algorithm which tries to get the city website URL from Wikipedia in three different ways (so if one fails, it tries the next one, etc.). I applied this to Indiana, Louisiana and New York, and it turned out to be unexpectedly effective.

Furthermore, I have now also scraped mayoral partisanship from Wikipedia for IN, LA and NY. For New York, this proved to be particularly effective (although it means that we would need to consider Wikipedia as a sufficiently trustworthy data source - which, after all the manual checking I did, I think it is).

Finally, rather than dismissing errors arising from this process as missing data, I went back and corrected them by hand.

As a result of all this, the number of URLs we have available for LA and NY has roughly doubled, and for IN, it has tripled.

I have combined this with the data for the 100 largest cities in the US (from Ballotpedia) as well as the GSA .gov URLs. This means we now have one dataset which contains all of this information (rather than previously, where I always had to do everything separately for IN and LA). Here are summary statistics for the number of cities for which we have information on party and the city website URL (note that these are upper bounds, because not all URLs can actually be scraped by wget, be it because they contain Javascript, don't work anymore. etc.): Democratic Republican Total Indiana 49 59 108 Louisiana 36 21 57 New York 36 16 52 Other 56 28 84 Washington 11 2 13 Total 188 126 314

For comparison, the version of the paper we presented at SPSA had 48 cities in total.

I've also added this table to the paper; as well as another which describes the number of cities per state. Notably, we have 15 cities in California (from the big cities data), which, together with WA, should solve our West Coast problem.

My next step will be to get the GPS coordinates for all cities from Wikipedia (I also gathered all the links to the biggest 100 cities' Wikipedia pages (and we already had all the other ones)) and then plot everything on a map.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/govWebsites/issues/3#issuecomment-370098919, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKWDW0QdNwe886kRzaP04a8N8-4geks5taed8gaJpZM4Rp0Py .

markusneumann commented 6 years ago

Before we get to that, we need to download all the (remaining) websites first. And on that matter, I am wondering whether it would make more sense to download everything (i.e. even the IN and LA websites we already have) again, since it is kind of inconsistent to have some websites from half a year ago, and some from now.

And as noted above, the numbers shown in the table are upper bounds. It is fairly common for websites not to get downloaded because of javascript, so I expect the final number of websites in our dataset to be around 200-250.

bdesmarais commented 6 years ago

Its a good idea to have it all gathered around the same time, and with the same script. Will the current computing resources available in the office and//or on ICS be workable for this?


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Mon, Mar 5, 2018 at 11:15 AM, Markus Neumann notifications@github.com wrote:

Before we get to that, we need to download all the (remaining) websites first. And on that matter, I am wondering whether it would make more sense to download everything (i.e. even the IN and LA websites we already have) again, since it is kind of inconsistent to have some websites from half a year ago, and some from now.

And as noted above, the numbers shown in the table are upper bounds. It is fairly common for websites not to get downloaded because of javascript, so I expect the final number of websites in our dataset to be around 200-250.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/govWebsites/issues/3#issuecomment-370472288, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKfKWNxVvgYaeaWGfNLsGRTczrvB1ks5tbWSfgaJpZM4Rp0Py .

markusneumann commented 6 years ago

Should be fine. I haven't used one of the two 3TB hard drives on the computer yet, so there is still plenty of space. I've started the download now with everything being downloaded in parallel. It's going at a rate of about 100mb/s, which seems to be some kind of limit (because when I was downloading only 10 cities in parallel it used only 10-20mb/s; whereas for the 100 biggest cities, it used about 80mb/s at its peak, and now we have ~three times as many as that). I think the download speed may not necessarily be the greatest impediment to speed anyway, since it also has to crawl around the websites, find links, check whether it has downloaded a specific file already, etc. Still, I would expect it to take at least 4-5 days.

markusneumann commented 6 years ago

The download process with wget still isn't done completely, but it has slowed down to the point where I think only a few files are left. Consequently I've started analyzing what we've got so far.

As expected not all cities got downloaded properly. Out of 314 cities, 10 failed entirely and another 74 only downloaded less than 5 files, which I'm also taking as a sign of failure. I've now used Selenium to visit every one of the urls in a browser and check which url the website redirects to. Using these new URLs, I will then retry the wget for the sites that didn't work.

Part of the problem seems to be the fact that wget has problems when 'www' is in our record of a city's url even though the url doesn't actually contain that. When an address like that is entered into a browser it will fix it by itself, hence the solution above.

So far we've got 1285449 files, but since the missing sites also contain some of the larger cities, that will likely increase quite a bit more. On average, a city has 5662.392 files with a standard deviation of 10429.4, i.e. it's a pretty tail-heavy distribution.

So, the plan for the next week or so:

  1. Re-do the download of the cities that didn't get downloaded.
  2. Compress everything into a zip file. Instead of one large one, I might make one for each city, I'm still testing that (this is related to this issue: https://github.com/desmarais-lab/govWebsites/issues/5).
  3. Fix file extensions & convert everything to text
  4. Preprocessing
  5. Structural topic model

For 3 & 4 I will check if I can find any way to optimize the code because speed really matters now. I haven't done much with the >1 million files so far, but everything is now taking minutes rather than seconds, so it will get MUCH worse when we actually get to the computationally expensive part.

For 5 this is a concern as well. Furthermore, I am skeptical whether state fixed effects will work because the stm package always has problems when the model gets too complex, and I wouldn't be surprised if fixed effects end up triggering that error.

bdesmarais commented 6 years ago

Hi Markus,

Thanks for the update. Can we talk this through with the lab tomorrow? If you have time, would you put together a few slides (nothing polished) that summarize where we are at now and what we need to do going forward?

For 3&4, how many cores would we need to run in parallel to get the timing down to what it was in previous iterations?

For 5, we should test the feasibility with a small sample from the processed dataset.

-Bruce


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Sun, Mar 18, 2018 at 11:04 PM, Markus Neumann notifications@github.com wrote:

The download process with wget still isn't done completely, but it has slowed down to the point where I think only a few files are left. Consequently I've started analyzing what we've got so far.

As expected not all cities got downloaded properly. Out of 314 cities, 10 failed entirely and another 74 only downloaded less than 5 files, which I'm also taking as a sign of failure. I've now used Selenium to visit every one of the urls in a browser and check which url the website redirects to. Using these new URLs, I will then retry the wget for the sites that didn't work.

Part of the problem seems to be the fact that wget has problems when 'www' is in our record of a city's url even though the url doesn't actually contain that. When an address like that is entered into a browser it will fix it by itself, hence the solution above.

So far we've got 1285449 files, but since the missing sites also contain some of the larger cities, that will likely increase quite a bit more.

So, the plan for the next week or so:

  1. Re-do the download of the cities that didn't get downloaded.
  2. Compress everything into a zip file. Instead of one large one, I might make one for each city, I'm still testing that (this is related to this issue: #5 https://github.com/desmarais-lab/govWebsites/issues/5 ).
  3. Fix file extensions & convert everything to text
  4. Preprocessing
  5. Structural topic model

For 3 & 4 I will check if I can find any way to optimize the code because speed really matters now. I haven't done much with the >1 million files so far, but everything is now taking minutes rather than seconds, so it will get MUCH worse when we actually get to the computationally expensive part.

For 5 this is a concern as well. Furthermore, I am skeptical whether state fixed effects will work because the stm package always has problems when the model gets too complex, and I wouldn't be surprised if fixed effects end up triggering that error.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/govWebsites/issues/3#issuecomment-374088075, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKf2XFFja6Is7aqg5Udme0L_bJy4Bks5tfyAigaJpZM4Rp0Py .

markusneumann commented 6 years ago

Downloading the data (about 1.6 million files in total) is done. So is compressing them into a tar.gz file (about 900GB).

The next step however isn't working so well. Since some files have the wrong ending, I need to read the first (nonempty) line of each file and then based on that decide what the correct file type is. Unfortunately, doing this for 1.6 million files takes forever.

I've tried readLines(x, n=1) and readChar(x, 100, useBytes = T), which according to this are the two fastest ways to read in text. It still takes hours, even when parallelized.

What's even stranger is that if I only do it to the first 500,000 texts, it only takes about 30 seconds. The next 500,000 however are projected (the pbapply package adds a progress bar and time estimate) to take hours. I've looked at the length of the texts, and the first 500,000 don't necessarily seem that much shorter. I've tried shuffling the data and then the first 500k also take hours - so there seem to be some files in there somewhere that are causing problems.

Also, I'm not sure it is even a resource problem. While this is running, only about 5% of the RAM is used. Also, only about 5% of each CPU thread is used. So maybe this is limited by the hard drive reading speed, which also would mean that parallelization is pointless.

And of course all of this is only for reading the first line (or first 100 bytes) of a file. Later, we obviously have to read in the whole file, and at the current pace, that would take days.

markusneumann commented 6 years ago

Problem solved. Python got it done in 2 hours, whereas R couldn't do it in 12.

bdesmarais commented 6 years ago

Glad you found a solution!


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Sun, Mar 25, 2018 at 9:53 PM, Markus Neumann notifications@github.com wrote:

Downloading the data (about 1.6 million files in total) is done. So is compressing them into a tar.gz file (about 900GB).

The next step however isn't working so well. Since some files have the wrong ending, I need to read the first (nonempty) line of each file and then based on that decide what the correct file type is. Unfortunately, doing this for 1.6 million files takes forever.

I've tried readLines(x, n=1) and readChar(x, 100, useBytes = T), which according to this https://www.r-bloggers.com/faster-files-in-r/ are the two fastest ways to read in text. It still takes hours, even when paralleled.

What's even stranger is that if I only do it to the first 500,000 texts, it only takes about 30 seconds. The next 500,000 however are projected (the pbapply package adds a progress bar and time estimate) to take hours. I've looked at the length of the texts, and the first 500,000 don't necessarily seem that much shorter. I've tried shuffling the data and then the first 500k also take hours - so there seem to be some files in there somewhere that are causing problems.

Also, I'm not sure it is even a resource problem. While this is running, only about 5% of the RAM is used. Also, only about 5% of each CPU thread is used. So maybe this is limited by the hard drive reading speed, which also would mean that parallellization is pointless.

And of course all of this is only for reading the first line (or first 100 bytes) of a file. Later, we obviously have to read in the whole file, and at the current pace, that would take days.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/govWebsites/issues/3#issuecomment-376024572, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKUJZM896xaneEvRHVBGjtNTN9u4nks5tiEoUgaJpZM4Rp0Py .

markusneumann commented 6 years ago

The Python solution worked, and it was clearly faster than R, but it caused problems with the next step (getting everything back into R and actually converting the files).

So rather than manually opening every file and looking at the first line, we are now using the unix library libmagic, which is designed to detect file types. There is an R interface to it (the wand package), so we don't have to use python, and the whole thing is done a bit more professionally than my ad-hoc solution.

Therefore, converting all the files to their correct format is done.

The next step, then, is to parse the files and convert them all to text. We are using Kenneth Benoit's readtext package for that purpose, which is already parallelized, so it should be fast (in theory) and worked just fine when we only had ~25000 files. I ran it on the 1.6 million files, but after 2-3 days, it still hadn't produced anything. So now I've chunked the input (into chunks of 10,000). The first 100,000 files went fine, but now it seems to have run into trouble. I'll provide another update once I find out what it is.

If it was working correctly, judging by the rate at which it was going so far, I'd estimate it would take about another day or two to finish.

The next step after that would then be to do the actual text preprocessing.

markusneumann commented 6 years ago

Okay, so the readtext package runs into trouble every couple thousand documents. This can be either unreadable filenames, password-protected pdfs, or some other bizarre problems. One time I even got a a segfault.

Since this was stopping the script every 30 minutes or so, and required me to manually go in, find the offending document (not that simple since we are going over the documents in chunks) and remove it, which means that over 1.6 million documents, it was going to take ages.

So now I've implemented some simple error handling. The downside of this is that before, the readtext function was vectorized, whereas now, I am feeding in every document individually. To compensate for this, I have parallelized it with the doParallel package and foreach loops. For some reason, the way I normally do parallel computing in R (parallel and pbapply packages) didn't perform well at all here (it was considerably slower than no parallelization). Even so, the current implementation seems to be about as fast as what we had before - if that had been working correctly. Note that the 1-2 day estimate from before was probably a lower bound, because the first 100,000 documents (which I had based that figure on) are, on average, considerably smaller than the rest. Processing time fortunately doesn't scale linearly with document size, but it is affected to some degree. Consequently I wouldn't be surprised if the whole thing takes 3-4 days instead - and that is if it runs without any problems from now on.

markusneumann commented 6 years ago

Preprocessing is now almost done, with only 5 cities to go.

The only part I didn't do is lemmatization, because it is a major performance bottleneck. So far, we've been using spaCy in Python, which is pretty quick, but wouldn't work as well with the current scheme of processing each city individually for memory reasons. So I've tried the new R wrapper for spaCy (by Benoit and colleagues), but it doesn't seem to be quite as fast, and unlike the Python version, only seems to be using one core.

Also, R seems to have some difficulties with getting the memory back after it is done with something, so at the moment, the script needs to be restarted 2-3 times after R runs out of memory.

Even so, preprocessing is pretty quick now compared to before. This is down to (a) rewriting everything to quanteda and (b) instead of the hash table method we implemented last year, I realized that the regular table() function in R seems to work just as well and is way faster (maybe I should talk this over with Frido since it really confuses me a little). This means we can probably experiment with the threshold for the duplicate content removal a bit.

On that matter: I haven't done the simulation yet, because I'm not quite sure how - Bruce, what do you mean by 'classify as substantive'?

bdesmarais commented 6 years ago

Sounds good overall, Markus. A couple things...

  1. By "substantive" I mean conveys information about the services and/or policies of the city government. Especially avoiding text that solely identifies the city. Essentially, we need to classify lines that we would like to keep in the data and lines that we would like to omit from the data.

  2. Regarding the spaCy wrapper....when functions that call other languages have memory management issues, I find that running gc() following their call in the code cleans up the memory. Can you use parallelApply() to hyper-thread the function?

-Bruce


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Mon, Apr 16, 2018 at 10:05 AM, Markus Neumann notifications@github.com wrote:

Preprocessing is now almost done, with only 5 cities to go.

The only part I didn't do is lemmatization, because it is a major performance bottleneck. So far, we've been using spaCy in Python, which is pretty quick, but wouldn't work as well with the current scheme of processing each city individually for memory reasons. So I've tried the new R wrapper for spaCy (by Benoit and colleagues), but it doesn't seem to be quite as fast, and unlike the Python version, only seems to be using one core.

Also, R seems to have some difficulties with getting the memory back after it is done with something, so at the moment, the script needs to be restarted 2-3 times after R runs out of memory.

Even so, preprocessing is pretty quick now compared to before. This is down to (a) rewriting everything to quanteda and (b) instead of the hash table method we implemented last year, I realized that the regular table() function in R seems to work just as well and is way faster (maybe I should talk this over with Frido since it really confuses me a little). This means we can probably experiment with the threshold for the duplicate content removal a bit.

On that matter: I haven't done the simulation yet, because I'm not quite sure how - Bruce, what do you mean by 'classify as substantive'?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/govWebsites/issues/3#issuecomment-381609403, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKXJwwh6mHzJI5D1vT5HcL1PAYdVjks5tpKURgaJpZM4Rp0Py .

markusneumann commented 5 years ago

Old, and the remaining issues with spaCy we had at the time have long been resolved by spacyR.