datamade / openness-project-nmid

Money Trail NM - New Mexico In Depth's Campaign Finance Explorer
https://moneytrailnm.com
3 stars 1 forks source link

updated campfin transaction data #172

Closed derekeder closed 3 years ago

derekeder commented 3 years ago

Another round of data from the New Mexico Secretary of State. This is likely the last import before they transition to a new system.

data notes

The data provided is the Cam_ContribExpenditure.csv data. The file sent was a zip file that contained 2 files:

Here are the steps I took to prep and import the data. I used csvkit, sql, LibreOffice and Sublime Text

  1. File needed to be converted to csv. For this I used in2csv

in2csv Cam_ContribExpenditure_1.xlsx > Cam_ContribExpenditure_1.csv in2csv Cam_ContribExpenditure_2.xlsx > Cam_ContribExpenditure_2.csv

  1. Next, the two files needed to be combined into one. I used csvstack

csvstack Cam_ContribExpenditure_1.csv Cam_ContribExpenditure_2.csv > Cam_ContribExpenditure.csv

After combining, I zipped up Cam_ContribExpenditure.csv and dropped it in the /data folder. However, the import failed.

  1. After some investigating, the DateAdded field was missing from Cam_ContribExpenditure_2.csv, so I opened up LibreOffice and added the column. The column could not be blank, so I copied the date data from DateContribution into this column.

I re-ran the import, but the import failed again. This is the error I got:

sqlalchemy.exc.DataError: (psycopg2.errors.StringDataRightTruncation) value too long for type character varying(255)

[SQL:
            UPDATE camp_fin_transaction SET
              id=s."contribexpenditureid"::bigint, contact_id=s."contactid"::bigint, amount=s."amount"::money::numeric::double precision, received_date=s."datecontribution"::timestamp with time zone, date_added=s."dateadded"::timestamp with time zone, check_number=s."checknumber"::varchar, memo=s."memo"::text, description=s."description"::varchar, transaction_type_id=s."contribexpendituretypeid"::bigint, filing_id=s."reportid"::bigint, olddb_id=s."olddbcontribexpenditureid"::bigint, name_prefix=s."prefix"::varchar, first_name=s."firstname"::varchar, middle_name=s."middlename"::varchar, last_name=s."lastname"::varchar, suffix=s."suffix"::varchar, company_name=s."companyname"::varchar, address=s."address"::varchar, city=s."city"::varchar, state=s."state"::varchar, zipcode=s."zip"::varchar, county_id=s."countyid"::bigint, country=s."country"::varchar, contact_type_id=s."contacttypeid"::bigint, transaction_status_id=s."transactionstatusid"::bigint, from_file_id=s."fromfileid"::bigint, contact_type_other=s."contacttypeother"::varchar, occupation=s."occupation"::varchar, expenditure_for_certified_candidate=s."isexpenditureforcertifiedcandidate"::boolean
            FROM (
              SELECT raw."contribexpenditureid", raw."contactid", raw."amount", raw."datecontribution", raw."dateadded", raw."checknumber", raw."memo", raw."description", raw."contribexpendituretypeid", raw."reportid", raw."olddbcontribexpenditureid", raw."prefix", raw."firstname", raw."middlename", raw."lastname", raw."suffix", raw."companyname", raw."address", raw."city", raw."state", raw."zip", raw."countyid", raw."country", raw."contacttypeid", raw."transactionstatusid", raw."fromfileid", raw."contacttypeother", raw."occupation", raw."isexpenditureforcertifiedcandidate"
              FROM raw_transaction AS raw
              JOIN change_transaction AS change
                ON raw."contribexpenditureid" = change.id
            ) AS s
            WHERE camp_fin_transaction.id = s."contribexpenditureid"
        ]

It looks like one of the fields was more than 255 characters. This was likely do to a data formatting issue.

  1. To diagnose, I fired up psql since the data was already in the raw_transaction table. I looked for any rows that were supposed to load into a field with a 255 character limit but were longer than that:
SELECT raw.* 
FROM raw_transaction AS raw
JOIN change_transaction AS change
ON raw."contribexpenditureid" = change.id
WHERE
length(raw."firstname") > 255 OR
length(raw."middlename") > 255 OR
length(raw."lastname") > 255 OR
length(raw."companyname") > 255 OR
length(raw."address") > 255 OR
length(raw."occupation") > 255

Sure enough, there was one row with a glitch in the address field:

Screen Shot 2020-09-18 at 3 24 25 PM

I loaded up the Cam_ContribExpenditure.csv file in Sublime Text (combined it was too large for LibreOffice) and searched for the corrupted line and fixed it by hand.

  1. Re zipped Cam_ContribExpenditure.csv, dropped it in the /data folder, and re-ran the import:

python manage.py import_data

This time, success!