Data4Democracy / assemble

NOT AN ACTIVE PROJECT -- Check readme for data sources
MIT License
36 stars 27 forks source link

Web scraping: Pull congressional record #27

Closed bstarling closed 7 years ago

bstarling commented 7 years ago

Looking for someone who can work with me to build a spider to pull the congressional record. This needs to be done by end of weekend so it is a tight turn around so looking for someone with time to spare.

Requirements:

We're looking to parse all 2017 activity from here and return 1 json file per day per category with the following fields:

date: (congressional date of record)
category: (daily digest, senate, house, extensions)
title: 
url : (url source)
text_blob:
hrefs: links in article ex: /congressional-record/volume-163/senate-section/page/S554
divyanair91 commented 7 years ago

on it!

johnmarcampbell commented 7 years ago

I actually wrote a spider to parse the congressional record a few weeks ago: https://github.com/johnmarcampbell/concord. This is a little more fine grained, as it's currently set up to return a json object for each item in the record, and not just 1 per category per day.

@divyanair91 It might be helpful to you to look at the spider I wrote.

bstarling commented 7 years ago

closed #38