dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
404 stars 216 forks source link

mysql_init_db.py and contributors data #42

Closed skabbit closed 6 years ago

skabbit commented 8 years ago

For now, some of data is not compatible with mysql_init_db.py script. Here is the traceback log of error:

downloading Illinois-campaign-contributions.txt.zip (~60mb) ...
extracting Illinois-campaign-contributions.txt.zip
importing raw data from csv...
creating donors table...
creating indexes on donors table
creating recipients table...
creating contributions table
Traceback (most recent call last):
  File "mysql_init_db.py", line 153, in <module>
    c.execute("INSERT INTO contributions "
  File "/Users/skabbit/venv/venv/lib/python2.7/site-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)
  File "/Users/skabbit/venv/venv/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (1411, "Incorrect datetime value: ' ' for function str_to_date")

Is it MySQL version problem? Or is it a because of downloaded data has changed?

skabbit commented 8 years ago

I've found out, it is because of sql_mode, which is lead to error on ' ' dates. Add c.execute("SET sql_mode = ''") after line c = conn.cursor() to prevent this error.

fgregg commented 8 years ago

I think newer versions of mysql are stricter than the version that this script was originally developed with. Would prefer fixes to the queries instead of turning off strictness.

pdamodaran commented 7 years ago

I was able to get past the issue by modifying line 153 to c.execute("INSERT IGNORE INTO contributions ". INSERT IGNORE will not insert records with invalid values.

jbeales commented 6 years ago

Some of the data has extra columns in the addresses. For example:

3215737 Chico & Nunes, P.C.     333 West Wacker Drive, Suite 1650   333 West Wacker Drive   Suite 1800  Chicago IL  60606           08/23/2010  0.0000  250.0000    Individual Contribution                                                             07/01/2010  12/31/2010  Citizens for David E Miller 14923   
3091186 Chico & Nunes, P.C.     333 W. Wacker Drive Suite 1650  Chicago IL  60606           07/15/2010  0.0000  1000.0000   Individual Contribution                                                             07/01/2010  12/31/2010  Citizens for Maldonado  9533    

Columns are separated by tabs. The first entry has a tab between the 2nd "333 West Wacker Drive" and "Suite 1800", so "Suite 1800" gets bumped to the raw_data.city column of the DB, and every column after that is offset by one.

mysql> select * from raw_table WHERE reciept_id=3215737;
+------------+---------------------+------------+-----------------------------------+-----------------------+------------+---------+------+-------------+---------------+-------------+--------+--------------+-------------------------+------------+------------------+-------------------+------------------+------------------+-------------+--------------+------------+-------------+---------------+---------------+---------------------+-------------------+----------------+-----------------------------+
| reciept_id | last_name           | first_name | address_1                         | address_2             | city       | state   | zip  | report_type | date_recieved | loan_amount | amount | receipt_type | employer                | occupation | vendor_last_name | vendor_first_name | vendor_address_1 | vendor_address_2 | vendor_city | vendor_state | vendor_zip | description | election_type | election_year | report_period_begin | report_period_end | committee_name | committee_id                |
+------------+---------------------+------------+-----------------------------------+-----------------------+------------+---------+------+-------------+---------------+-------------+--------+--------------+-------------------------+------------+------------------+-------------------+------------------+------------------+-------------+--------------+------------+-------------+---------------+---------------+---------------------+-------------------+----------------+-----------------------------+
|    3215737 | Chico & Nunes, P.C. |            | 333 West Wacker Drive, Suite 1650 | 333 West Wacker Drive | Suite 1800 | Chicago | IL   | 60606       |               | 08/23/2010  | 0.0000 | 250.0000     | Individual Contribution |            |                  |                   |                  |                  |             |              |            |             |               |               |                     | 07/01/2010        | 12/31/2010     | Citizens for David E Miller |
+------------+---------------------+------------+-----------------------------------+-----------------------+------------+---------+------+-------------+---------------+-------------+--------+--------------+-------------------------+------------+------------------+-------------------+------------------+------------------+-------------+--------------+------------+-------------+---------------+---------------+---------------------+-------------------+----------------+-----------------------------+
1 row in set (1.75 sec)

There are 6 columns like this. Since this is a demo the best answer might be to remove those rows from Illinois-campaign-contributions.txt

Or, alternately, run a DELETE FROM raw_table WHERE LENGTH(date_recieved)=0 once the import is complete.

Edit: It turns out some dates are also set to 0.0000 due to a missing tab character in the address section so DELETE FROM raw_table WHERE LENGTH(date_recieved) < 10 works better. We are deleting very few records in total with this statement - something like 10.