Data4Democracy / just-politics

Identifying vulnerable house and senate seats in the 2018 midterm elections
12 stars 5 forks source link

List of county-specific Facebook pages #8

Open gati opened 7 years ago

gati commented 7 years ago

This is an unglamorous but important task. With a list of county-level Facebook pages we can collect the comments form those pages, and then inspect the language in those comments using some techniques that are pretty good at revealing sentiment on key campaign issues.

Our hypothesis is that understanding this "hyper-local" sentiment will be important in predicting which congressional districts could make surprising switches from one party to another.

justforwhimsy commented 7 years ago

I can start by looking into the 36 counties in Oregon. I took a look at the local sheriff of one county and realized that Facebook has a recommendation for similar groups which pointed to other organizations in the county. Perhaps there is something in the Facebook API that can collect a bunch of related pages that would make it easier to collect info from all counties in the US. I'm looking into it, but I am definitely a beginner at it.

gati commented 7 years ago

Thanks @justforwhimsy for taking a look at this! That's a great starting point, and that's a good idea to look into the "related pages" that FB already generates. Also, if there are things every county has (like a name, a sheriff), and we could get a list of those things (like the list of every county sheriff in the country) we might be able to use the /search API endpoint (https://developers.facebook.com/docs/graph-api/using-graph-api#search).

justforwhimsy commented 7 years ago

I'm out of town this weekend, but here are some of my preliminary thoughts after manually combing through some Oregon county facebook pages.

justforwhimsy commented 7 years ago

I made a csv containing all the county,states in the U.S. from wikipedia's article. Apologies for wrapping a csv in a text file to upload it, but I wanted to make it available in case anyone wanted to use it this weekend. us_counties.csv.txt

gati commented 7 years ago

Thanks @justforwhimsy for looking into this! The list of counties will definitely be helpful, and I appreciate the investigation into existing FB pages in OR.

I think the answer to all your questions is "yes." :) I should point out that we already have some code that'll retrieve FB page posts from a list of pages, along with comments + interactions on those posts. So once we have our list, it'll be easy to flip that on. As for monitoring - I think that might be important, but the first step is to figure out what social media activity (if any) is a signal for voter support. So we'll get historical content for pages that aren't active anymore, and then if it turns out some feature from the language or social activity is informative, we can start monitoring.

Just let me know when you're ready to start collecting historical data and I can walk you through how the collect-social module (https://github.com/data4democracy/collect-social) works with FB.

Thank you again for digging into this!

Phinneas commented 7 years ago

I haven't had much of a chance this weekend to look deeply into this. Hopefully this week will give me more time to do so. Cheers

justforwhimsy commented 7 years ago

I made a collection of the FB pages in Oregon and collected a few relevant Twitter pages when possible. It's currently living in a Google Spreadsheet. A site that I found useful when searching for different county pages was Suburban Stats, it listed the population in each county for Oregon. It's much more difficult to find a relevant FB page for a county with only 1500 people, but if they have around 30k, there tends to be a lot of different pages to pull from.

I looked into the Facebook API to see if there was a way to pull the "People Also Like" page suggestions, but I couldn't find anything. I was planning to look into scraping the page instead.

I took a look at the collect-social module and noticed that it is currently inactive and development has moved to smtk. Should we use that instead to collect data?

justforwhimsy commented 7 years ago

Welp, it looks like Facebook doesn't want you to scrape their pages. Any other ideas?

Phinneas commented 7 years ago

Maybe we could just focus on what we can do with Twitter data at this point. I can ask some of my connections at Facebook if there is a work around, but for now maybe just work on the Twitter stuff. How does that sound?

justforwhimsy commented 7 years ago

I wouldn't mind looking into what we can do on the Twitter side.

The only other thoughts I had for Facebook is to use a search for pages in a particular county and then check to see which results have an address in the state that the county is in.

Something I would like to nail down before we collect more accounts from Facebook or Twitter is how what the ideal number of accounts for each county would be and how we would want to store the data. Any thoughts? I was putting information in a google spreadsheet, but if we're collecting multiples accounts in each county for both Twitter and Facebook, we should probably keep them in separate docs.

justforwhimsy commented 7 years ago

I did just poke around with the Facebook Explorer API and it looks like we can use queries like "search?q="clark county"&type=page" to get pages related to certain counties.

justforwhimsy commented 7 years ago

Something else that is relevant for both the Twitter and Facebook data is what exactly we mean by a county. I was reading some more into the counties because I realized the content in my CSV was smaller than the number of reported counties. It looks like several states have cities that do not fall into their own county so they're being grouped as a county.

For example, Virginia has 38 independent cities while Rhode Island has 5 counties, but they don't have any governmental functions. Should we try to collect accounts for the other cities @gati?

gati commented 7 years ago

Sorry I've been away for a bit. Thanks for keeping after this @justforwhimsy and @phinneas. Some thoughts based on what came up in the thread:

smtk is being rolled into a larger ETL project, but collect social is still a good starting point for quickly grabbing some FB or Twitter data and throwing it into a database, so that's probably a good place to start for pages we already know about.

Re scraping - yes it's definitely difficult to scrape Facebook pages. If it looks like the only way to do this is manually searching for each county name, then maybe we can pay mechanical Turk. I'll look into this option.

About the county question, I think it's ok that we have both cities and county pages. We can normalize back to the American community survey definition of "county" after the fact. :)

gati commented 7 years ago

Sorry again for the slow response. Work got the better of me last week! Thanks again for working on this. Many, many groups are going to get a ton of value out of this as people start to look seriously at 2018

justforwhimsy commented 7 years ago

@gati and @Phinneas I wrote something up in Python for the Facebook pages. I wasn't confident in committing this to just politics, but I would be happy to add it here if others want to help out. It can be found here: https://github.com/justforwhimsy/d4d_scripts

public_county_pages_fb_pages.sql contains several counties collected via the get_related_county_accounts.py script. I added information to the README with some limitations.

I think a similar approach could be made with the Twitter accounts, but I found in researching the Twitter API that you're more likely to just get a state in most accounts if they have something.

I'm volunteering to the 22nd and then my computer access will be limited until July 3rd, but I should be able to answer questions in the evening if there are any.

justforwhimsy commented 7 years ago

If anyone else is working on this, I found a csv containing the different cities/zip codes by counties. We could import that into a database and use that instead of Google's API which has a daily request limitation. Would there be a better tool to use for comparing the csv? It's a little over 42k rows. https://www.unitedstateszipcodes.org/zip-code-database/

justforwhimsy commented 6 years ago

Okay, so almost all of the counties got collected and there are about 156K pages in the database now. The code and database are located in my repository. I updated the readme with some information on what each column is, but if someone has questions about it, I'm happy to answer.

gati commented 6 years ago

This is so amazing! Thank you @justforwhimsy for working on this. I think @alejandrox1 is working on collecting comments from these pages, after which I think we'll have a really amazing dataset that we can use to tease out what people care about across the US. This is so cool!