TromboneDavies / PolarOps

0 stars 0 forks source link

Get time series data to use for The Mythical Graph(tm) #33

Closed divilian closed 3 years ago

divilian commented 3 years ago

Using Reddit, NYT, Twitter, or whatever, let's gather some posts that are timestamped with month/year, and run the classifier on them and see if we can produce the mythical time series graph.

divilian commented 3 years ago

@akochans is doing some preliminary investigation into this.

divilian commented 3 years ago

Subtasks:

  1. Find out which Subreddits we've trained on are "most balanced" between polarized and unpolarized. (This task is somewhat related to #45)
  2. Actually install psaw. :)
  3. Write a bot that connects to Reddit and asks for all threads for all submissions from that Subreddit from the beginning of time until now, and dumps them in a .csv file with these columns: Subreddit, date, submission ID, Comment ID, the text of all comments in the thread smooshed together. See how many you get in 24 hours, so we can start to get an estimate of how much data we can get this summer.
divilian commented 3 years ago

Looks like date search function on Reddit has actually been broken for 3 years. @rockladyeagles will investigate whether "before/after" could be a substitute.

jk, PSAW actually will allow us to do this, it was just temporarily broken.

akochans commented 3 years ago

As of 6/27: All Polarized: BannedFromThe_Donald, TruthLeaks, Liberal All Non-Polarized: AskPolitics, ModeratePolitics, NeutralPolitics, AbortionDebate (*r/moderatepolitics and r/neutralpolitics appear twice in our training data due to discrepancies in capitalization, this is also true for r/conservative_) Only near even split: Politics (9:10) A little bit of a mix**: Congress, Conservative, IllegalImmigration, NeverTrump, Republican, Centrist, Progun

Of the "near-even split or little bit of mix" subreddits, these are all 2010 or earlier:

divilian commented 3 years ago

@TromboneDavies has finished coding a basic bot.

First steps:

Second steps:

TromboneDavies commented 3 years ago

A list of various subreddits and their creation date:

BannedFromThe_Donald: 3/4/2016 TruthLeaks: 2/25/2017 Liberal: 3/2/2009 Ask_Politics: 10/26/2011 ModeratePolitics: 11/9/2010 NeutralPolitics: 2/14/2012 AbortionDebate: 3/22/2012 Politics: 8/6/2007 Congress: 3/6/2009 Conservative: 1/25/2008 IllegalImmigration: 10/24/2011 NeverTrump: 2/28/2016 Republican: 10/10/2008 Centrist: 4/24/2009 Progun: 12/18/2012

divilian commented 3 years ago

Changes to collector.py:

  1. If .csv file is not present, create an empty one.
  2. Run script forever.
  3. Instead of grabbing n each time, grab a small number of threads each time. And use the previously most-recent submission time as the "after" argument. (Either store this somewhere when you retrieve a new submission, or go back to the API in step 3 to get the date of the most recent submission ID, which is in the .csv file.)
  4. Immediately write (and flush()) those to the .csv file.
  5. Lather, rinse, repeat. (CTRL-C must be used to terminate.)
divilian commented 3 years ago

Use 'bots as a proxy for polarization? On the theory that when comments are rejected, that means it was "[moderated]" because it was polarized? Clever thought, but after discussion, maybe this isn't reliable.

divilian commented 3 years ago

Commit d114591 is almost there. Additional bells and whistles to add:

  1. When we're done reading all the data from a subreddit, detect that and quit.
  2. (When we hit a rate limit, pause, instead of banging our head against the API? @TromboneDavies says "doesn't matter" and he's probably right.)
  3. When we read from the subreddit and get data that is already in the CSV, take action (terminate?)
divilian commented 3 years ago

Closing this issue and spawning new ones, since we've now handled what the gist of this issue was. New issues: #52 #53 #54 #56

divilian commented 3 years ago

Sucking on 4th of July weekend: