Closed fgregg closed 10 years ago
Aw, man, .NET form postbacks. :-1:
I hear you, on the other hand, after we finish this scraper. I'd like to spin out the DotNetScraper as it's own repo. The legistar scraper I've been working on here https://github.com/fgregg/pupa-legistar would subclass the DotNetScraper
Would be a pretty valuable contribution and make our lives less painful.
Now that I'm looking at it, I might be able to get around actually submitting the form on the reports pages. The pages have distinct URLs and you can figure out how many pages by looking at the "Records 1 to 15 out of [whatever]" at the bottom.
Just so we're clear, I'm getting all reports that are available electronically for each committee? And just the candidates in your JSON packet that have 'Alderman/Chicago' listed as the office (for now)?
Do an estimate. If it will take less than a day, grab all the data.
If it will take more, then just grab the data for alderman that are in the json file. Once you have that data, then start working on the api.
While you are working on the api, you can grab the rest of the data.
On Fri, Oct 18, 2013 at 8:10 AM, Eric van Zanten notifications@github.comwrote:
Now that I'm looking at it, I might be able to get around actually submitting the form on the reports pages. The pages have distinct URLs and you can figure out how many pages by looking at the "Records 1 to 15 out of [whatever]" at the bottom.
Just so we're clear, I'm getting all reports that are available electronically for each committee? And just the candidates in your JSON packet that have 'Alderman/Chicago' listed as the office (for now)?
— Reply to this email directly or view it on GitHubhttps://github.com/datamade/war-chest/issues/5#issuecomment-26594060 .
773.888.2718 2231 N. Monticello Ave Chicago, IL 60647
You can get around the postback by putting the the pageindex
into the url
http://www.elections.il.gov/CampaignDisclosure/CommitteeDetail.aspx?id=15786&pageindex=0 http://www.elections.il.gov/CampaignDisclosure/CommitteeDetail.aspx?id=15786&pageindex=1
Yeah, got it. This should be pretty simple. I'm actually starting on the database for this, too, cause it'll make the whole thing a bit simpler and once I get to the API part, it'll be pretty much just a matter of figuring out the queries.
Alternatively, with DotNetScraper the postback is not too painful, something like this will work
s = DotNetScraper()
url = 'http://www.elections.il.gov/CampaignDisclosure/CommitteeDetail.aspx?id=15786'
page = s.lxmlize(url)
payload = s.sessionSecrets(page)
payload['__EVENTTARGET'] ='ctl00$ContentPlaceHolder1$Listnavigation$btnPageNext'
next_page = s.lxmlize(url, payload)
Do we care about the reports that don't have detail pages? Basically the ones that don't link to anything? So on this page that would be the "D-2 Non Participation Report" and all the A-1 reports.
Let's keep a record of them.
On Fri, Oct 18, 2013 at 11:55 AM, Eric van Zanten notifications@github.comwrote:
Do we care about the reports that don't have detail pages? Basically the ones that don't link to anything? So on this pagehttp://www.elections.il.gov/CampaignDisclosure/CommitteeDetail.aspx?id=82&pageindex=0that would be the "D-2 Non Participation Report" and all the A-1 reports.
— Reply to this email directly or view it on GitHubhttps://github.com/datamade/war-chest/issues/5#issuecomment-26611746 .
773.888.2718 2231 N. Monticello Ave Chicago, IL 60647
Cool. Hey so on that same page, there are reports that list something like '1992 GE' as the reporting period. Any idea what the two letter codes signify? Looks like there are 'GE', 'GP', 'CE', and 'CP'.
Hey, some of these reports are PDFs: http://www.elections.il.gov/CampaignDisclosure/CDPdfViewer.aspx?FiledDocID=417814&DocType=Image
How should I handle that case?
Just saving a reference to the file, for the moment. Don't think there's really going to be a clean way to get data out of those things.
@fgregg I just pushed up 25bd2ea884cc89f18f2549e9fb24ca11362baec2 which has a version of the committee scraper in it. One thing that I have yet to account for is when there is a negative value in the amounts reported on the details page (or I'm guessing that's what it means when it's surrounded by parenthesis). Anyways, the DB is in the repo so you can take a look at it and see if it seems sound.
@fgregg So, last night I noticed that my committee report scraper was not getting info from some of the detail pages since they had a slightly different layout. While investigating how to fix it, I noticed that the "Total Receipts" don't include "Total In-Kind" (for instance, this page). Should I make a separate column for that so that we have that info? Currently, I am saving:
Funds at the start of the reporting period Funds at the end of the reporting period Total Receipts Total Expenditures
Oh, the other thing that is included on those reports is "Total Debts and Obligations". Should we be saving that, too?
Anyways, It took about 3 hours to scrape the reports for the 734 aldermanic candidates that I had after including the ones from the file you dropped in the repo last night. Looks like there are about 23,000 candidates that we've scraped so scraping all the reports would probably take a few days.
Let's scrape all the details, and sort out their meaning later. Thanks, @evz
On Sat, Oct 19, 2013 at 8:49 AM, Eric van Zanten notifications@github.comwrote:
@fgregg https://github.com/fgregg So, last night I noticed that my committee report scraper was not getting info from some of the detail pages since they had a slightly different layout. While investigating how to fix it, I noticed that the "Total Receipts" don't include "Total In-Kind" (for instance, this pagehttp://www.elections.il.gov/CampaignDisclosure/D2Quarterly.aspx?id=438815). Should I make a separate column for that so that we have that info? Currently, I am saving:
Funds at the start of the reporting period Funds at the end of the reporting period Total Receipts Total Expenditures
Oh, the other thing that is included on those reports is "Total Debts and Obligations". Should we be saving that, too?
Anyways, It took about 3 hours to scrape the reports for the 734 aldermanic candidates that I had after including the ones from the file you dropped in the repo last night. Looks like there are about 23,000 candidates that we've scraped so scraping all the reports would probably take a few days.
— Reply to this email directly or view it on GitHubhttps://github.com/datamade/war-chest/issues/5#issuecomment-26650047 .
773.888.2718 2231 N. Monticello Ave Chicago, IL 60647
Once the candidates are scraped, we'll have the committee ids for each candidate. We should then scrape the committee pages
This should be pretty easy since the committee id is in the url, for example the "525 Political Club's" committee ID is 15786
http://www.elections.il.gov/campaigndisclosure/CommitteeDetail.aspx?id=15786
What we want to scrape are the reports, like http://www.elections.il.gov/campaigndisclosure/D2Quarterly.aspx?id=498921
On the reports page, do NOT follow the Itemized Links, and scrape that data. We already have much of that specific receipt and contribution data in a dump which we should process later.
The key information we want to know are
Start with chicago aldermanic races.
While you are working on this, please think about how we could do smart updates once we have all this data.