Data4Democracy / far-right-analysis

Analysis related to the behavior of extreme far right online communities
35 stars 10 forks source link

Network mapping/analysis prototype #7

Open sjacks26 opened 7 years ago

sjacks26 commented 7 years ago

This is a proof-of-concept for network mapping/analysis of far-right blogs/websites. Basically, my (rough) thinking is we could grab all the links in a blog, parse them, and count the domains so we can generate some normalized "citation" score. I think we should collect both blogrolls and links mentioned in posts, but we should keep them separate: something appearing on a blogroll means something different than being mentioned in a post.

So here's 3 blogs that are huge:

Steps:

  1. Find all links in posts
  2. Parse all links to identify domain name
  3. Generate domain counts (normalized by total number of pages, or total number of out-links, or something else)

We can recursive-ize this by doing the same thing will all the domains linked. I have a feeling that the number of posts in level 0 might mean level 1 is astronomically large, but I don't know that.

If this works and we want to scale it up, I have a list of 160 domains, almost all of which are associated with the patriot/militia movement.

Thoughts?

sjacks26 commented 7 years ago

I started working on doing something like this a couple weeks ago and wrote some sloppy Python that did a couple things right but most things wrong. If someone wants to work on this and wants to see that, let me know.

ccarey commented 7 years ago

@sjacks26 Can you upload them to a directory here on GitHub? I'd be interested in taking a look at them.

sjacks26 commented 7 years ago

@ccarey check out https://github.com/Data4Democracy/far-right-analysis/blob/master/citation_analysis/find_hrefs_loop.py. I was trying to do this on sites that I had mirrored, so the script is missing the piece that scrapes from the websites directly.

ccarey commented 7 years ago

@sjacks26 Thanks! I'll try to look into this some this weekend. Will let you know if I have any questions, but just at first glance the code you already have looks like a solid jumping-off point.

bbrewington commented 7 years ago

I wrote some code to scrape all posts for a given blog domain, for the whole year. See pull request: https://github.com/Data4Democracy/far-right-analysis/pull/16