datamade / war-chest

API for Chicago Aldermen's Campaign Funds
http://chicagoelections.datamade.us/war-chest/
1 stars 0 forks source link

Some numbers not matching results from older query #34

Open evz opened 10 years ago

evz commented 10 years ago

@fgregg I spent some time yesterday reworking the queries that this app uses to get the values from the committee reports and wanted to make sure that I am doing this correctly based upon what you need for your calculations.

The numbers that don't seem to be matching are the receipts and expenditures from the current and last cycles. You wrote a rather complex query to get around the fact that all the reports don't necessarily have the same names. It looked like this:

SELECT Sum(receipts),
       Sum(expenditures)
FROM   report,
       (SELECT REPLACE(report.type, "(Amendment)", "") AS type,
               period_from,
               period_to,
               committee_id,
               Max(date_filed) AS date_filed
        FROM report
        WHERE  committee_id = 4410
               AND ( type LIKE "D-2 semiannual report%"
                      OR type LIKE "Quarterly%" )
        GROUP BY REPLACE(type, "(Amendment)", ""),
                  period_from,
                  period_to) AS M
WHERE REPLACE(report.type, "(Amendment)", "") = M.type
       AND report.period_from = M.period_from
       AND report.period_to = M.period_to
       AND report.date_filed = M.date_filed
       AND report.committee_id = M.committee_id
       AND report.period_from >= '2011-07-01' 

That particular query is for the latest report from the Friends of Edward M Burke (committee ID 4410) and produces the tuple (1641131.8, 822470.06) for the receipts and expenditures. However, if you go to the page for the latest report that we scraped here you'll see that the total receipts and expenditures from that report are different (367830.84 and 158114.57 respectively). And that is what I am getting from the reworked query that I made based upon a new field (called generic_type) that I added that basically pre-bakes the REPLACE parts in your query above. That query looks like this:

SELECT report.* FROM report
WHERE report.committee_id = 4410 
AND 
    (report.generic_type LIKE "D-2 semiannual report%" 
        OR report.generic_type LIKE "Quarterly%") 
ORDER BY report.date_filed DESC LIMIT 1

Which is simpler and returns the numbers from the report page (total receipts and total expenditures). Am I missing something here? Should I be getting total receipts and expenditures from all of the reports for a given reporting period? I have been unable to reproduce the numbers that your more complex query gives which makes me think there is something happening there that I either don't understand or is unexpected.

fgregg commented 10 years ago

There are two types of numbers we are looking for: 1. current balances, 2. totals over periods.

Yesterday, we discussed queries for the first type of number. For current balance, we try to find the most recent report of type Quarterly or Semiannual and look at only that report.

For the second type, we look at all the Quarterly or Semiannual reports from within an election cycle. For each report, we pull out some number, like expenditure, and we total up all these numbers to get an total expenditure for this period.

This can be tricky because sometimes there's an amended report that supersedes a previous report. We need to make sure that for every reporting period (the period covered by a report), we are only counting the numbers from one report, and that we are using the numbers from the most recently filed report that covers that reporting period.

Make sense?

On Tue, Apr 22, 2014 at 8:49 AM, Eric van Zanten notifications@github.comwrote:

@fgregg https://github.com/fgregg I spent some time yesterday reworking the queries that this app uses to get the values from the committee reports and wanted to make sure that I am doing this correctly based upon what you need for your calculations.

The numbers that don't seem to be matching are the receipts and expenditures from the current and last cycles. You wrote a rather complex query to get around the fact that all the reports don't necessarily have the same names. It looked like this:

SELECT Sum(receipts), Sum(expenditures)FROM report, (SELECT REPLACE(report.type, "(Amendment)", "") AS type, period_from, period_to, committee_id, Max(date_filed) AS date_filed FROM report WHERE committee_id = 4410 AND ( type LIKE "D-2 semiannual report%" OR type LIKE "Quarterly%" ) GROUP BY REPLACE(type, "(Amendment)", ""), period_from, period_to) AS MWHERE REPLACE(report.type, "(Amendment)", "") = M.type AND report.period_from = M.period_from AND report.period_to = M.period_to AND report.date_filed = M.date_filed AND report.committee_id = M.committee_id AND report.period_from >= '2011-07-01'

That particular query is for the latest report from the Friends of Edward M Burke (committee ID 4410) and produces the tuple (1641131.8, 822470.06)for the receipts and expenditures. However, if you go to the page for the latest report that we scraped herehttp://www.elections.state.il.us/CampaignDisclosure/D2Quarterly.aspx?id=521705you'll see that the total receipts and expenditures from that report are different (367830.84 and 158114.57 respectively). And that is what I am getting from the reworked query that I made based upon a new field (called generic_type) that I added that basically pre-bakes the REPLACE parts in your query above. That query looks like this:

SELECT report.* FROM reportWHERE report.committee_id = 4410 AND (report.generic_type LIKE "D-2 semiannual report%" OR report.generic_type LIKE "Quarterly%") ORDER BY report.date_filed DESC LIMIT 1

Which is simpler and returns the numbers from the report page (total receipts and total expenditures). Am I missing something here? Should I be getting total receipts and expenditures from all of the reports for a given reporting period? I have been unable to reproduce the numbers that your more complex query gives which makes me think there is something happening there that I either don't understand or is unexpected.

— Reply to this email directly or view it on GitHubhttps://github.com/datamade/war-chest/issues/34 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

fgregg commented 10 years ago

This query could be dramatically simplified if we added an "active.version" boolean flag to the table.

On Tue, Apr 22, 2014 at 10:32 AM, Forest Gregg fgregg@uchicago.edu wrote:

There are two types of numbers we are looking for: 1. current balances, 2. totals over periods.

Yesterday, we discussed queries for the first type of number. For current balance, we try to find the most recent report of type Quarterly or Semiannual and look at only that report.

For the second type, we look at all the Quarterly or Semiannual reports from within an election cycle. For each report, we pull out some number, like expenditure, and we total up all these numbers to get an total expenditure for this period.

This can be tricky because sometimes there's an amended report that supersedes a previous report. We need to make sure that for every reporting period (the period covered by a report), we are only counting the numbers from one report, and that we are using the numbers from the most recently filed report that covers that reporting period.

Make sense?

On Tue, Apr 22, 2014 at 8:49 AM, Eric van Zanten <notifications@github.com

wrote:

@fgregg https://github.com/fgregg I spent some time yesterday reworking the queries that this app uses to get the values from the committee reports and wanted to make sure that I am doing this correctly based upon what you need for your calculations.

The numbers that don't seem to be matching are the receipts and expenditures from the current and last cycles. You wrote a rather complex query to get around the fact that all the reports don't necessarily have the same names. It looked like this:

SELECT Sum(receipts), Sum(expenditures)FROM report, (SELECT REPLACE(report.type, "(Amendment)", "") AS type, period_from, period_to, committee_id, Max(date_filed) AS date_filed FROM report WHERE committee_id = 4410 AND ( type LIKE "D-2 semiannual report%" OR type LIKE "Quarterly%" ) GROUP BY REPLACE(type, "(Amendment)", ""), period_from, period_to) AS MWHERE REPLACE(report.type, "(Amendment)", "") = M.type AND report.period_from = M.period_from AND report.period_to = M.period_to AND report.date_filed = M.date_filed AND report.committee_id = M.committee_id AND report.period_from >= '2011-07-01'

That particular query is for the latest report from the Friends of Edward M Burke (committee ID 4410) and produces the tuple (1641131.8, 822470.06)for the receipts and expenditures. However, if you go to the page for the latest report that we scraped herehttp://www.elections.state.il.us/CampaignDisclosure/D2Quarterly.aspx?id=521705you'll see that the total receipts and expenditures from that report are different (367830.84 and 158114.57 respectively). And that is what I am getting from the reworked query that I made based upon a new field (called generic_type) that I added that basically pre-bakes the REPLACE parts in your query above. That query looks like this:

SELECT report.* FROM reportWHERE report.committee_id = 4410 AND (report.generic_type LIKE "D-2 semiannual report%" OR report.generic_type LIKE "Quarterly%") ORDER BY report.date_filed DESC LIMIT 1

Which is simpler and returns the numbers from the report page (total receipts and total expenditures). Am I missing something here? Should I be getting total receipts and expenditures from all of the reports for a given reporting period? I have been unable to reproduce the numbers that your more complex query gives which makes me think there is something happening there that I either don't understand or is unexpected.

— Reply to this email directly or view it on GitHubhttps://github.com/datamade/war-chest/issues/34 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647