IndexError after upgrading to 1.4.6

asieira commented 6 years ago

I am using gglsbl-rest for queries at a large scale, and after upgrading to 1.4.6 I've started seeing the following errors in my logs:

[2017-09-30 14:23:40,537] ERROR in app: exception handling [https://vignette.wikia.nocookie.net/gameofthrones/images]
Traceback (most recent call last):
File "/root/gglsbl-flask/app.py", line 37, in app_lookup
sbl = SafeBrowsingList(api_key, active['name'], True)
File "/usr/local/lib/python2.7/site-packages/gglsbl/client.py", line 33, in __init__
self.storage = SqliteStorage(db_path)
File "/usr/local/lib/python2.7/site-packages/gglsbl/storage.py", line 60, in __init__
if not self.check_schema_version():
File "/usr/local/lib/python2.7/site-packages/gglsbl/storage.py", line 74, in check_schema_version
v = dbc.fetchall()[0][0]
IndexError: list index out of range

Still trying to get steps that reliably reproduce this issue, will post them here if I get them.

afilipovich commented 6 years ago

From what I understand, it may happen if there is a race condition between one process reading from DB while another process initializes it.

DB is initialized on a first run if .sqlite file does not exist or after gglsbl upgrade if its schema version does not match one expected by the library.

asieira commented 6 years ago

So gglsbl-rest code will only start the HTTP listener or switch to a new database if it has been fully downloaded first. Plus, downgrading back to gglsbl 1.4.5 immediately stopped all occurrences of this error on the logs while using the exact same gglsbl-rest code otherwise.

Did anything change between 1.4.5 and 1.4.6 that might have this effect?

afilipovich commented 6 years ago

I have changed schema in 1.4.6 to improve cache update performance (about 2x). Since new schema is not compatible with previous schema, I also added mechanism to make sure schema matches codebase.

Can you please try 1.4.6 one more time, but delete .sqlite DB file before the first run? It looks like it got corrupted somehow...

asieira commented 6 years ago

When I upgrade and deploy to AWS ECS, brand new containers running 1.4.6 are spun up, and gradually replace the containers using 1.4.5. So we can be pretty certain the scenario where a 1.4.5-created database was read by the 1.4.6 code did not happen.

afilipovich commented 6 years ago

I still cannot see any obvious reason for such condition. How frequently does this error happen, say per 1k queries?

Also I would suggest to initialize SafeBrowsingList() upon app start and re-use sbl object between queries, this would reduce per-query overhead.

asieira commented 6 years ago

You can see here the period of time when I switched, there was a very significant spike in 500 errors that was caused by this:

The volume of 500 errors dropped immediately after I rolled back to 1.4.5.

Thanks for the performance tip, I'll try to make the object persistent across calls.

afilipovich commented 6 years ago

I tried to replicate the issue by running gglsbl-rest docker container locally but everything worked as expected. The only way to trigger that error is to manually delete records from "metadata" table in SQLite cache.

I do not see how it can happen in normal course of events as "metadata" table gets populated immediately after creation: https://github.com/afilipovich/gglsbl/blob/master/gglsbl/storage.py#L97

The only explanation I can think of is that container disk was full with just enough space for single cache file, while second cache file could only be created but not populated with data.

asieira commented 6 years ago

By the way, I did implement your suggestion of reusing the SafeBrowsingList object when possible, and it did have a dramatically positive impact on performance. Thanks for bringing that up.

asieira commented 6 years ago

Did the code changes between 1.4.5 and 1.4.5 significantly increase the disk space usage of the local DB? Maybe that could be the root cause of the problems I observed.

afilipovich commented 6 years ago

Not really, a few Kb at most... I think the problem just became visible.

asieira commented 6 years ago

So I simulated the scenario where the disk space ended, and sure enough the sqlite driver raises an exception:

Traceback (most recent call last):
  File "update.py", line 59, in <module>
    update_hash_prefix_cache()
  File "/usr/local/lib/python2.7/site-packages/pid/decorator.py", line 14, in decorator
    return func(*func_args, **func_kwargs)
  File "update.py", line 46, in update_hash_prefix_cache
    sbl.update_hash_prefix_cache()
  File "/usr/local/lib/python2.7/site-packages/gglsbl/client.py", line 75, in update_hash_prefix_cache
    self.storage.populate_hash_prefix_list(response_threat_list, hash_prefix_list)
  File "/usr/local/lib/python2.7/site-packages/gglsbl/storage.py", line 314, in populate_hash_prefix_list
    dbc.executemany(q, records)
sqlite3.OperationalError: database or disk is full

That didn't happen on the IndexError scenario. 🤔

asieira commented 6 years ago

So I deployed gglsbl-rest 1.2.0 on my production environment again.

Turns out the size increase for a fresh database increased from around 1.7 gigs to 2.1 gigs, about 30% increase.

Strangely, though, I'm not getting the IndexError exceptions immediately after the upgrade as I used to. 🤔

I'll close this for now, and will let you know if this reoccurs. Filing under "heisenbug". 😄

afilipovich commented 6 years ago

This is a real mystery, I am glad it fixed itself :)

asieira commented 6 years ago

Sure enough, it's back:

[2017-10-27 07:12:51,979] ERROR in app: exception handling [https://yt3.ggpht.com/-djh3mwbyimg/aaaaaaaaaai]
Traceback (most recent call last):
File "/root/gglsbl-rest/app.py", line 34, in _lookup
sbl = SafeBrowsingList(api_key, active['name'], True)
File "/usr/local/lib/python2.7/site-packages/gglsbl/client.py", line 33, in __init__
self.storage = SqliteStorage(db_path)
File "/usr/local/lib/python2.7/site-packages/gglsbl/storage.py", line 60, in __init__
if not self.check_schema_version():
File "/usr/local/lib/python2.7/site-packages/gglsbl/storage.py", line 74, in check_schema_version
v = dbc.fetchall()[0][0]
IndexError: list index out of range

afilipovich commented 6 years ago

Can you please share .sqlite cache file so I can inspect it? (might be difficult to extract it from Docker)

asieira commented 6 years ago

Right, I think I may have found the issue:

2017-07-03T14:15:58+0000 pid=5 update INFO downloading database to /root/gglsbl-flask/db/gsb_v4.a.db.tmp
Traceback (most recent call last):
File "update.py", line 59, in <module>
update_hash_prefix_cache()
File "/usr/local/lib/python2.7/site-packages/pid/decorator.py", line 14, in decorator
return func(*func_args, **func_kwargs)
File "update.py", line 46, in update_hash_prefix_cache
sbl.update_hash_prefix_cache()
File "/usr/local/lib/python2.7/site-packages/gglsbl/client.py", line 72, in update_hash_prefix_cache
self.storage.populate_hash_prefix_list(response_threat_list, hash_prefix_list)
File "/usr/local/lib/python2.7/site-packages/gglsbl/storage.py", line 271, in populate_hash_prefix_list
dbc.executemany(q, records)
sqlite3.OperationalError: disk I/O error

This is different from not enough disk space... any idea what might cause this exception to be raised?

asieira commented 6 years ago

Found this reference here, might help: https://gist.github.com/17twenty/8831301

afilipovich commented 6 years ago

I found that update.py touches JOURNAL file: https://github.com/mlsecproject/gglsbl-rest/blob/master/update.py#L50

This is kinda unusual as JORNAL file is created only for the duration of transaction. If there is JOURNAL file when update.py switches cache DBs, it is likely that transaction is is progress and moving files around may corrupt DB.

asieira commented 6 years ago

Interesting. So the journal file will eventually be deleted by itself? If I find one should I wait until it is no longer there?

afilipovich commented 6 years ago

I'd do some tests to make sure it gets auto-deleted. However if sqlite connection is closed, no transaction should be running...

asieira commented 6 years ago

How about adding a close method to the SafeBrowsingList class that ensures any pending transactions are committed and that the sqlite connection is closed? Then I could call that before moving the file in update.py.

asieira commented 6 years ago

Again, rolling back to using 1.4.5 immediately solved everything:

afilipovich commented 6 years ago

It's possible that 1.4.5 is also broken, but does not raise an exception.

afilipovich commented 6 years ago

Can you please add del sbl here: https://github.com/mlsecproject/gglsbl-rest/blob/master/update.py#L48

It should close sqlite connection.

asieira commented 6 years ago

Good suggestion, will try it and get back to you.

asieira commented 6 years ago

Nope, still having issues:

asieira commented 6 years ago

BTW, I also added another 10 gigs of disk space to each of my ECS instances, so it's likely not disk space issues either since this is 5x the database size of extra space.

afilipovich commented 6 years ago

Uhm, okay, thanks for giving it a try... Could you please try to comment out this line and see if it solves these errors? https://github.com/afilipovich/gglsbl/blob/master/gglsbl/storage.py#L66

It will likely slow down cache update but should keep DB file consistent.

asieira commented 6 years ago

Did some reading here and was wondering if actually switching to sqlite WAL mode could allow gglsbl-rest to keep a single copy of the database and update it in-place... Should be possible, given how this would allow reads and writes to happen concurrently. What do you think?

asieira commented 6 years ago

Maybe dropping the entire "switching databases over time" logic could do away with the errors we're seeing, and make gglsbl-rest waaaaay simpler to boot.

asieira commented 6 years ago

Also, I just realized Alpine 3.4 (which gglsbl-rest is currently using) is based on sqlite 3.13 and many bugs have been corrected since that. I'll try Alpine 3.6 which has sqlite 3.20.1 and see if that helps at all.

asieira commented 6 years ago

Wow, this solved it. Apparently some fix between sqlite 3.13 and 3.20.1 took care of this problem.

I'll update the gglsbl-rest Dockerfile to use Alpine 3.6 and am finally closing this now.

afilipovich / gglsbl

IndexError after upgrading to 1.4.6 #28