fecgov / FEC

A general discussion forum for FEC.gov. This is the best place to submit general feedback.
https://www.fec.gov
Other
226 stars 55 forks source link

Dirty data present in filings? #11991

Open NickCrews opened 2 years ago

NickCrews commented 2 years ago

Hi! I'm a data engineer with a possible bug. First though, hats off to you all, the whole FEC ecosystem has been surprisingly easy to interact with! So thanks for your work :)

I've been looking at the raw filings summaries that I downloaded from https://www.fec.gov/data/filings/?data_type=processed&min_receipt_date=01%2F01%2F2010&most_recent=true&form_type=F3&form_type=F3P. Hopefully that link should be reproducible, at least for this purpose.

If I do

import pandas as pd

df = pd.read_csv("data/filings.csv", dtype="string")
mask = df["most_recent_file_number"].duplicated(False) # Get ALL dupes, not just the first instance
dupes = df[mask].sort_values("most_recent_file_number")

Then I see the following table. Note how in the most_recent_file_number column, there are a few filings that appear twice. I wouldn't expect this to happen, I would expect there would be only one. Second, if you look at the is_amended column, I would expect that to always be False, since these are the most current versions of the reports. But, one is True. Third, if you look at amendment_chain vs previous_file_number, they don't always agree.

Am I misunderstanding the schema/meanings of this table? Or is this a data integrity problem? If it's a problem, I thought I'd point it out in case you wanted to add some QA to catch these sorts of problems.

office committee_type bank_depository_name bank_depository_street_1 bank_depository_street_2 bank_depository_state committee_id committee_name candidate_id candidate_name cycle sub_id coverage_start_date coverage_end_date receipt_date election_year form_type report_year report_type document_type document_type_full report_type_full beginning_image_number ending_image_number pages total_receipts total_individual_contributions net_donations total_disbursements total_independent_expenditures total_communication_cost cash_on_hand_beginning_period cash_on_hand_end_period debts_owed_by_committee debts_owed_to_committee house_personal_funds senate_personal_funds opposition_personal_funds treasurer_name file_number primary_general_indicator request_type amendment_indicator update_date pdf_url fec_url means_filed is_amended most_recent html_url state party amendment_chain previous_file_number most_recent_file_number amendment_version form_category bank_depository_city bank_depository_zip additional_bank_names
420 H H C00449181 VISCONTI FOR CONGRESS 2010 1101920100004732085 2010-04-01 00:00:00 2010-06-30 00:00:00 2010-10-18 00:00:00 F3 2010 Q2 JULY QUARTERLY 10030452399 10030452414 16 1800 1785 0 16002 0 -395682 P A 2010-10-19 22:09:53 https://docquery.fec.gov/pdf/399/10030452399/10030452399.pdf paper f t {-348889,-395682} -348888 -395682 1 REPORT
89722 H H C00449181 VISCONTI FOR CONGRESS 2010 1101920100004732085 2010-04-01 00:00:00 2010-06-30 00:00:00 2010-10-18 00:00:00 F3 2010 Q2 JULY QUARTERLY 10030452399 10030452414 16 1800 1785 0 16002 0 -395682 P A 2010-10-19 22:09:53 https://docquery.fec.gov/pdf/399/10030452399/10030452399.pdf paper f t {-348888,-395682} -348888 -395682 1 REPORT
22011 P P C00489633 JOHN DAVIS FOR PRESIDENT 2010 1012620110005197616 2010-07-01 00:00:00 2010-09-30 00:00:00 2011-01-25 00:00:00 F3P 2010 Q3 OCTOBER QUARTERLY 11030551299 11030551307 9 10000 7029 0 2970.00 15080 0 -467861 P A 2011-01-26 23:03:56 https://docquery.fec.gov/pdf/299/11030551299/11030551299.pdf paper f t {-399088,-432826,-467861} -432826 -467861 2 REPORT
93393 P P C00489633 JOHN DAVIS FOR PRESIDENT 2010 1012620110005197616 2010-07-01 00:00:00 2010-09-30 00:00:00 2011-01-25 00:00:00 F3P 2010 Q3 OCTOBER QUARTERLY 11030551299 11030551307 9 10000 7029 0 2970.00 15080 0 -467861 P A 2011-01-26 23:03:56 https://docquery.fec.gov/pdf/299/11030551299/11030551299.pdf paper f t {-399088,-467861} -432826 -467861 1 REPORT
11611 P P C00500082 2014 1102420140016248804 2014-07-01 00:00:00 2014-09-30 00:00:00 2014-10-23 00:00:00 F3P 2014 Q3 OCTOBER QUARTERLY 14031324025 14031324042 18 123 123 123 0.00 0 0 -8442913 G A 2014-10-24 21:16:02 https://docquery.fec.gov/pdf/025/14031324025/14031324025.pdf paper f t {-8397828,-8393823,-8442913} -8393823 -8442913 2 REPORT
97793 P P C00500082 2014 1102420140016248804 2014-07-01 00:00:00 2014-09-30 00:00:00 2014-10-23 00:00:00 F3P 2014 Q3 OCTOBER QUARTERLY 14031324025 14031324042 18 123 123 123 0.00 0 0 -8442913 G A 2014-10-24 21:16:02 https://docquery.fec.gov/pdf/025/14031324025/14031324025.pdf paper f t {-8397828,-8442913} -8393823 -8442913 1 REPORT
128561 P P C00685974 STEVE RICHEY FOR PRESIDENT 2020 2020 1110120190194676888 2019-07-01 00:00:00 2019-09-30 00:00:00 2019-10-30 00:00:00 F3P 2019 Q3 OCTOBER QUARTERLY 201911010300300967 201911010300300992 26 0 3838.18 18628.9 14790.71 20000 0 -9634054 A 2019-11-01 21:19:28 https://docquery.fec.gov/pdf/967/201911010300300967/201911010300300967.pdf paper f t {-9629174,-9634054} -9629174 -9634054 1 REPORT
138695 P P C00685974 STEVE RICHEY FOR PRESIDENT 2020 2020 1110120190194676888 2019-07-01 00:00:00 2019-09-30 00:00:00 2019-10-30 00:00:00 F3P 2019 Q3 OCTOBER QUARTERLY 201911010300300967 201911010300300992 26 0 3838.18 18628.9 14790.71 20000 0 -9634054 A 2019-11-01 21:19:28 https://docquery.fec.gov/pdf/967/201911010300300967/201911010300300967.pdf paper t t {-9629174,-9634054} -9629174 -9634054 1 REPORT
PaulClark2 commented 2 years ago

Looking at one of the filings, 1386952, in your table, I don't see anything wrong with the data. Filing 1386952 has not been amended. Most_recent_file_number, file_number, previous_file_number and amendment_chain will all contain the same report ID (file_number), 1386952.

most_recent_file_number file_number previous_file_number amendment_chain is_amended most_recent amendment_version amendment_indicator committee_id form_type report_year report_type
1386952 1386952 1386952 {1386952} f t 0 N C00737213 F3 2020 12P

Looking at a report that has been amended might help, too. Note, for this example, I changed your query parameters to include all versions of filings by removing the "Current Version" filter. The example below shows a report that was amended twice.

most_recent_file_number file_number previous_file_number amendment_chain is_amended most_recent amendment_version amendment_indicator committee_id form_type report_year report_type
1151343 1118027 1118027 {1118027} t f 0 N C00554709 F3 2016 12G
1151343 1131084 1118027 {1118027,1131084} t f 1 A C00554709 F3 2016 12G
1151343 1151343 1131084 {1118027,1131084,1151343} f t 2 A C00554709 F3 2016 12G

Most_recent_file_number is the same for all three entries because only one filing is the most recent.

New (original report) (amendment_version = 0): most_recent_file_number = file_number of amendment 2, file_number, previous_file_number and amendment_chain all contain the same report ID (file_number). The is_amended flag is true, and the most_recent flag is false.

Amendment 1 (amendment_version = 1): most_recent_file_number = file_number of amendment 2, file_number = file_number of this report (amendment 1), previous_file_number and amendment_chain = file_number of the original report. The is_amended flag is true, and the most_recent flag is false.

Amendment 2 (amendment_version = 1): most_recent_file_number and file_number = file_number of this report (amendment 2), previous_file_number = file_number of amendment 1, amendment_chain is an array containing the file_numbers of the original, amendment 1 and amendment 2. The is_amended flag is false, and the most_recent flag is true.