dragosrotaru / ppeforfree

Collective sensemaking for mutual aid groups manufacturing PPE during COVID.
https://ppeforfree.org
GNU General Public License v3.0
5 stars 5 forks source link

Scraping Facebook Groups General Information #8

Closed dragosrotaru closed 4 years ago

dragosrotaru commented 4 years ago

Note on FB Scraping, Data Privacy, Future Roadmap

See #5

Prerequisite: Seed Data

See #6

Requirements

Scraping Facebook Groups General Information

We need data on all the Facebook groups in the community.

The data available on public FB groups (not including content like posts, pics, events, etc) I have found by manually going through 2 FB group pages includes:

Note: I compiled this by manually going through 2 FB group pages, please go through a few more pages yourself to see if some groups have more, less or differing public data available and we will update our schema

We will not get any other information about individuals other than their facebook id. This data is needed because we want to see how connected groups are (how many individuals they have in common) and we want to reach out to those individuals that are in a shit ton of groups! Very useful for coalition-building

Scraping Posts

I started a script in scripts/facebook-group-posts-scraper using this library: https://github.com/kevinzg/facebook-scraper

It works well! But! We NEED to collect the timestamp on all the posts. It doesnt work with 100% consistency, you will have to troubleshoot. We will use this data to make a news aggregator and to keep an eye out for more data for coalition-building purposes.

How your script will store and normalize the data

Database will be MongoDB

Schema

type Group = {
  id: UUID,
  name: string,
  foundedOn: TimeStamp,
  public: boolean,
  description: string,
  memberCount: number,
  adminCount: number,
  moderatorCount: number,
  memberCountIncreaseWeekly: number,
  postCountIncreaseMonthly: number,
  postCountIncreaseDaily: number,
  memberList: UUID[],
  adminList: UUID[],
  moderatorList: UUID[],
  pageList: UUID[],
  scrapedAt: TimeStamp,
  scrapeID: UUID,
}

Misc

Random lib I found: https://github.com/ParvJain/Facebook-Group-Scraper (please look through)

kurtvan commented 4 years ago

Getting started on this now 👍 Gonna attempt to leverage the facebook-scraper library as much as I can and add timestamp support.