Scrape committee reports

fgregg commented 10 years ago

Once the candidates are scraped, we'll have the committee ids for each candidate. We should then scrape the committee pages

This should be pretty easy since the committee id is in the url, for example the "525 Political Club's" committee ID is 15786

http://www.elections.il.gov/campaigndisclosure/CommitteeDetail.aspx?id=15786

What we want to scrape are the reports, like http://www.elections.il.gov/campaigndisclosure/D2Quarterly.aspx?id=498921

On the reports page, do NOT follow the Itemized Links, and scrape that data. We already have much of that specific receipt and contribution data in a dump which we should process later.

The key information we want to know are

how much money does a committe have
how much money came in during the reporting period
how much money went out

Start with chicago aldermanic races.

While you are working on this, please think about how we could do smart updates once we have all this data.

evz commented 10 years ago

Aw, man, .NET form postbacks. :-1:

fgregg commented 10 years ago

I hear you, on the other hand, after we finish this scraper. I'd like to spin out the DotNetScraper as it's own repo. The legistar scraper I've been working on here https://github.com/fgregg/pupa-legistar would subclass the DotNetScraper

Would be a pretty valuable contribution and make our lives less painful.

evz commented 10 years ago

Now that I'm looking at it, I might be able to get around actually submitting the form on the reports pages. The pages have distinct URLs and you can figure out how many pages by looking at the "Records 1 to 15 out of [whatever]" at the bottom.

Just so we're clear, I'm getting all reports that are available electronically for each committee? And just the candidates in your JSON packet that have 'Alderman/Chicago' listed as the office (for now)?

fgregg commented 10 years ago

Do an estimate. If it will take less than a day, grab all the data.

If it will take more, then just grab the data for alderman that are in the json file. Once you have that data, then start working on the api.

While you are working on the api, you can grab the rest of the data.

On Fri, Oct 18, 2013 at 8:10 AM, Eric van Zanten notifications@github.comwrote:

Now that I'm looking at it, I might be able to get around actually submitting the form on the reports pages. The pages have distinct URLs and you can figure out how many pages by looking at the "Records 1 to 15 out of [whatever]" at the bottom.

Just so we're clear, I'm getting all reports that are available electronically for each committee? And just the candidates in your JSON packet that have 'Alderman/Chicago' listed as the office (for now)?

— Reply to this email directly or view it on GitHubhttps://github.com/datamade/war-chest/issues/5#issuecomment-26594060 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

fgregg commented 10 years ago

You can get around the postback by putting the the pageindex into the url

http://www.elections.il.gov/CampaignDisclosure/CommitteeDetail.aspx?id=15786&pageindex=0 http://www.elections.il.gov/CampaignDisclosure/CommitteeDetail.aspx?id=15786&pageindex=1

evz commented 10 years ago

Yeah, got it. This should be pretty simple. I'm actually starting on the database for this, too, cause it'll make the whole thing a bit simpler and once I get to the API part, it'll be pretty much just a matter of figuring out the queries.

fgregg commented 10 years ago

Alternatively, with DotNetScraper the postback is not too painful, something like this will work

s = DotNetScraper()
url = 'http://www.elections.il.gov/CampaignDisclosure/CommitteeDetail.aspx?id=15786'
page = s.lxmlize(url)
payload = s.sessionSecrets(page)
payload['__EVENTTARGET'] ='ctl00$ContentPlaceHolder1$Listnavigation$btnPageNext'
next_page = s.lxmlize(url, payload)

evz commented 10 years ago

Do we care about the reports that don't have detail pages? Basically the ones that don't link to anything? So on this page that would be the "D-2 Non Participation Report" and all the A-1 reports.

fgregg commented 10 years ago

Let's keep a record of them.

On Fri, Oct 18, 2013 at 11:55 AM, Eric van Zanten notifications@github.comwrote:

Do we care about the reports that don't have detail pages? Basically the ones that don't link to anything? So on this pagehttp://www.elections.il.gov/CampaignDisclosure/CommitteeDetail.aspx?id=82&pageindex=0that would be the "D-2 Non Participation Report" and all the A-1 reports.

— Reply to this email directly or view it on GitHubhttps://github.com/datamade/war-chest/issues/5#issuecomment-26611746 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

evz commented 10 years ago

Cool. Hey so on that same page, there are reports that list something like '1992 GE' as the reporting period. Any idea what the two letter codes signify? Looks like there are 'GE', 'GP', 'CE', and 'CP'.

evz commented 10 years ago

Hey, some of these reports are PDFs: http://www.elections.il.gov/CampaignDisclosure/CDPdfViewer.aspx?FiledDocID=417814&DocType=Image

How should I handle that case?

evz commented 10 years ago

Just saving a reference to the file, for the moment. Don't think there's really going to be a clean way to get data out of those things.

evz commented 10 years ago

@fgregg I just pushed up 25bd2ea884cc89f18f2549e9fb24ca11362baec2 which has a version of the committee scraper in it. One thing that I have yet to account for is when there is a negative value in the amounts reported on the details page (or I'm guessing that's what it means when it's surrounded by parenthesis). Anyways, the DB is in the repo so you can take a look at it and see if it seems sound.

evz commented 10 years ago

@fgregg So, last night I noticed that my committee report scraper was not getting info from some of the detail pages since they had a slightly different layout. While investigating how to fix it, I noticed that the "Total Receipts" don't include "Total In-Kind" (for instance, this page). Should I make a separate column for that so that we have that info? Currently, I am saving:

Funds at the start of the reporting period Funds at the end of the reporting period Total Receipts Total Expenditures

Oh, the other thing that is included on those reports is "Total Debts and Obligations". Should we be saving that, too?

Anyways, It took about 3 hours to scrape the reports for the 734 aldermanic candidates that I had after including the ones from the file you dropped in the repo last night. Looks like there are about 23,000 candidates that we've scraped so scraping all the reports would probably take a few days.

fgregg commented 10 years ago

Let's scrape all the details, and sort out their meaning later. Thanks, @evz

On Sat, Oct 19, 2013 at 8:49 AM, Eric van Zanten notifications@github.comwrote:

@fgregg https://github.com/fgregg So, last night I noticed that my committee report scraper was not getting info from some of the detail pages since they had a slightly different layout. While investigating how to fix it, I noticed that the "Total Receipts" don't include "Total In-Kind" (for instance, this pagehttp://www.elections.il.gov/CampaignDisclosure/D2Quarterly.aspx?id=438815). Should I make a separate column for that so that we have that info? Currently, I am saving:

Funds at the start of the reporting period Funds at the end of the reporting period Total Receipts Total Expenditures

Oh, the other thing that is included on those reports is "Total Debts and Obligations". Should we be saving that, too?

Anyways, It took about 3 hours to scrape the reports for the 734 aldermanic candidates that I had after including the ones from the file you dropped in the repo last night. Looks like there are about 23,000 candidates that we've scraped so scraping all the reports would probably take a few days.

— Reply to this email directly or view it on GitHubhttps://github.com/datamade/war-chest/issues/5#issuecomment-26650047 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

datamade / war-chest

Scrape committee reports #5