NCVotes / voters-ingestor

Django app to create and configure Postgres DB, then fetch and process public voter registration and voter history files from N.C. Board of Elections
MIT License
10 stars 1 forks source link

failing on ncid that doesn't appear to be in db #4

Closed rtburg closed 6 years ago

rtburg commented 6 years ago

From @calebsmith

After I fixed the immediate issue with this I've been trying to get the process caught back up. I've processed through a few days of files without issue but I've hit upon a problem that I just can't seem to figure out. I've pasted the stacktrace below:

Traceback (most recent call last): File "/var/www/ncvoter/src/ncvoter/manage.py", line 22, in execute_from_command_line(sys.argv) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/core/management/init.py", line 363, in execute_from_command_line utility.execute() File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/core/management/init.py", line 355, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv self.execute(args, cmd_options) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute output = self.handle(args, options) File "/var/www/ncvoter/src/ncvoter/voter/management/commands/voter_process.py", line 275, in handle process_files(output=True, county_num=county_num, initial=initial) File "/var/www/ncvoter/src/ncvoter/voter/management/commands/voter_process.py", line 228, in process_files process_changes(output, file_tracker) File "/usr/lib/python3.6/contextlib.py", line 53, in inner return func(args, kwds) File "/var/www/ncvoter/src/ncvoter/voter/management/commands/voter_process.py", line 146, in process_changes NCVoter.objects.bulk_create(unwritten_ncvoters) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(args, kwargs) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/db/models/query.py", line 442, in bulk_create ids = self._batched_insert(objs_without_pk, fields, batch_size) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/db/models/query.py", line 1077, in _batched_insert inserted_id = self._insert(item, fields=fields, using=self.db, return_id=True) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/db/models/query.py", line 1060, in _insert return query.get_compiler(using=using).execute_sql(return_id) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 1099, in execute_sql cursor.execute(sql, params) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/db/backends/utils.py", line 65, in execute return self.cursor.execute(sql, params) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/db/utils.py", line 94, in exit six.reraise(dj_exc_type, dj_exc_value, traceback) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/utils/six.py", line 685, in reraise raise value.with_traceback(tb) File "/var/www/ncvoter/env/lib/python3.6/site-packages/django/db/backends/utils.py", line 65, in execute return self.cursor.execute(sql, params) django.db.utils.IntegrityError: duplicate key value violates unique constraint "voter_ncvoter_ncid_key" DETAIL: Key (ncid)=(CL62260) already exists.

As far as I can tell, this is indicating that it can't add a record to the voter_ncvoter table because a record with the ncid 'CL62260' already exists. My initial thought is that this seems like a bug in the code and that the data should be updating the existing row rather than trying to create a new one.

However, when I look at the database for a record with that ncid, I can't find it. I'm really uncertain how to proceed. I spent a few hours trying to research and dig a bit with no luck. I've only ever seen this error before when the record already exists with that value.

rtburg commented 6 years ago

Can confirm that no such ncid exists in table:SELECT * FROM voter_ncvoter WHERE ncid='CL62260';

rtburg commented 6 years ago

It looks like the ncid appears three times in voter_changetracker table -- twice from file_tracker_id 1105 and both times with op_code 'A'. Then one more time from file_tracker_id 1905, and that time with op_code 'M': SELECT * FROM voter_changetracker WHERE data ->> 'ncid'='CL62260';

It looks like there are about a dozen ncids that appear twice with op_code 'A' in the 1105 file_tracker. (This also occurs in other file_trackers): SELECT ncid, COUNT(ncid) AS count FROM voter_changetracker WHERE op_code='A' AND file_tracker_id='1905' GROUP BY ncid ORDER BY count DESC ;

I think I'm not clear about when a row gets inserted into voter_changetracker and when it is assiged an op_code of 'A' vs. 'M'

bill10 commented 6 years ago

just to catch up, this means that the raw data is messy and include duplicates of "CL62260"? So the code failed to update the records of "CL62260" because it does not know which one to update?

bill10 commented 6 years ago

I am trying to reproduce the error and see what is going on. @calebsmith or @rtburg, could you let me know where the old files are stored? Thank you!

calebsmith commented 6 years ago

Thanks for looking into this. I was stumped by it and don't have the necessary bandwidth to dig in unfortunately.

From Ryan's observations about having multiple records with the 'A' op_code, I think that is a bug. The changetracker is supposed to only ever have 1 addition for something, and modifications after that. That is the intended behavior at least, and obviously that went awry somehow if we have 2.

I'm not sure if the raw data has duplicates. That does seem worth looking into.

The downloads are inside of /var/www/ncvoter/src/ncvoter/downloads . You may need to sudo as the ncvoter user because that should be the owner of everything there. (And your user should have permissions to do so).

Hopefully this can help get you started. Let me know other questions as they arise and I'll try to do my best.

bill10 commented 6 years ago

thanks, caleb. I found the files and I am debugging now.

bill10 commented 6 years ago

It is weird, as there is only one "CL62260" in the file 1105 (downloads/ncvoter/2017-08-13T02:00:42:1502589642/ncvoter53.txt). I will let django re-process this file and see if it is a bug in the code or something went wrong when that file was processed.

bill10 commented 6 years ago

After some investigations, I think caleb’s idea is right. The file 1105 was probably processed in parallel which resulted in conflicts, and none of the new ncid’s in that file was inserted into the voter table. After I set this file’s processed status to false, removed the problematic ncid’s (attached below) from the change_tracker table, and re-processed file 1105, everything seems fine. Now those ncid’s are added to the voter table, as it should be. And the status of this file 1105 is correct.

But 1105 is not the only bad file, there are 742 files whose change_tracker_processed=true and updates_processed=false. To re-process those files, I need to know what revised ncid’s are in them and remove them from change_tracker table. It is probably also necessary to pause the cron job in case it conflicts with the processing again.

caleb, is this idea too dangerous or crazy?

But the good news is that if the parallel issue is the only problem, then we don’t really have to modify caleb’s original code as it is working properly — we just need to make sure that any new voter file won’t be processed by more than one parsers at the same time.

bill10 commented 6 years ago

Just to document what I did:

  1. Turn off cron job by comment out the line in /etc/cron.d/voter-process.

  2. Find files that were processed but not inserted into the database

    select id 
    from voter_filetracker
    where change_tracker_processed=true and updates_processed=false;

    742 records found.

  3. Delete records parsed from those files in the voter_changetracker table

    delete from voter_changetracker 
    where file_tracker_id in (select id 
    from voter_filetracker
    where change_tracker_processed=true and updates_processed=false);

    Removed 3232705 rows.

  4. Set those files to un-processed

    update voter_filetracker
    set change_tracker_processed=false
    where change_tracker_processed=true and updates_processed=false;

    Updated 742 records.

  5. Re-process the files

    cd /var/www/ncvoter
    sudo ./manage.sh voter_process
  6. Resume the cron job by uncommenting the line in /etc/cron.d/voter-process.

Of course, the database was backed up before anything was done. The process is still running and I will check back tomorrow.

rtburg commented 6 years ago

@bill10 Thanks for doing this. Any updates? How is the process going?

Also, what do we need to do to make sure that any new voter file won’t be processed by more than one parser at the same time?

bill10 commented 6 years ago

I just checked and it is still processing the file Processing NCVHis data from file downloads/ncvhis/2017-10-20T02:01:10:1508464870/ncvhis90.txt. At the current speed, it might another day or two.

There is not much we need to do to make sure new files won't be parsed in parallel other than not doing it. Basically, caleb's code will do the processing automatically once in a while. If anyone wants to do the same thing manually, just remember to pause caleb's code. I am writing that into the github documentation.

rtburg commented 6 years ago

@bill10 I recently downgraded the EC2 instance to m3.large. Was a t2.2xlarge when it originally ran back in July, so let me know if we need to temporarily increase firepower.

bill10 commented 6 years ago

Here are Ryan's checks. The issue seems to be solved. I am going to close it now, and resume scheduled parsing in the server.

SELECT COUNT(*)
FROM voter_filetracker
WHERE date_trunc('day', voter_filetracker.created) > '2017-08-06 00:00:00+00'
AND (voter_filetracker.change_tracker_processed = 'false'
OR voter_filetracker.updates_processed = 'false')
-- I get zero records.

-- I also can’t find any records with a blank “registr_dt” field:
SELECT *
FROM voter_ncvoter
WHERE registr_dt IS NULL;

-- There appear to be 7,888,053 records in ncvoter table.
-- And the same number of records have a date that is not null.

-- Our missing voter now exists in the database one and only one time:
SELECT *
FROM voter_ncvoter
WHERE ncid = 'CL62260';

-- This voter’s record was changed three times:
SELECT *
FROM voter_changetracker
WHERE ncid='CL62260';
--The voter was “ADDED” to the database on 8/13/2017
--The voter’s record was then MODIFIED on 8/20/2017
--And MODIFIED again on 9/10/2017