Data4Democracy / indivisible

Aggregating call to action sites into a single application.
25 stars 19 forks source link

build training dataset #24

Open pghosh opened 7 years ago

pghosh commented 7 years ago

We will build training data for classifying actions in the following ways,

scrap websites and create csv with

  1. action text
  2. action tag

    manually screened data from the emails

Dataset will be saved in data.world and used for tagging actions identified in emails.

Websites to start with https://resistancenearme.org/ www.risestronger.org

Business value: This task is there to server as the first step for auto tagging action items . This is data collection. The goal is to create labeled dataset that can be used to train classifiers to auto tag actions. To start with we should use 'event type' from resistancenearme as the tag. The scrapping task should map event text with one to one map. For email we need to analyze the text to see if we can find pattern that makes the text/action fall into a category. idea is if we can identify pattern then we can write scripts to do the tagging. if not we should spin up task to manually go through emails and tag them. Some starting pointers are Check the email address , some organizations tend to organize certain kind of tasks See the verbs , that might actullay have something like rally These are just ideas, feel free to add what works and what does not work

brucerowan commented 7 years ago

I'm going to give this a try, anyone that wants to help is welcome.

brucerowan commented 7 years ago

Was able to successfully create a .py scraper using beautiful soup to scrape the call to actions of risestronger.org. However, there are only 10 items. Will give resistancenearme.org/ a shot today

crypdick commented 7 years ago

@brucerowan Any update?

brucerowan commented 7 years ago

@crypdick Hi sorry for the delayed response, we've made some good progress. I would check out this repository https://github.com/brucerowan/indivisible/tree/scrap_websites/ingest/web_scraper

brucerowan commented 7 years ago

@crypdick If you are good at object oriented programming that would actually help me out a lot. Basically, if you could understand how to implement the base_scraper class to the resistancenear me.py file that would help me out a lot. message me @bruce_r on slack.