Closed asieira closed 6 years ago
From what I understand, it may happen if there is a race condition between one process reading from DB while another process initializes it.
DB is initialized on a first run if .sqlite file does not exist or after gglsbl upgrade if its schema version does not match one expected by the library.
So gglsbl-rest code will only start the HTTP listener or switch to a new database if it has been fully downloaded first. Plus, downgrading back to gglsbl 1.4.5 immediately stopped all occurrences of this error on the logs while using the exact same gglsbl-rest code otherwise.
Did anything change between 1.4.5 and 1.4.6 that might have this effect?
I have changed schema in 1.4.6 to improve cache update performance (about 2x). Since new schema is not compatible with previous schema, I also added mechanism to make sure schema matches codebase.
Can you please try 1.4.6 one more time, but delete .sqlite DB file before the first run? It looks like it got corrupted somehow...
When I upgrade and deploy to AWS ECS, brand new containers running 1.4.6 are spun up, and gradually replace the containers using 1.4.5. So we can be pretty certain the scenario where a 1.4.5-created database was read by the 1.4.6 code did not happen.
I still cannot see any obvious reason for such condition. How frequently does this error happen, say per 1k queries?
Also I would suggest to initialize SafeBrowsingList() upon app start and re-use sbl object between queries, this would reduce per-query overhead.
You can see here the period of time when I switched, there was a very significant spike in 500 errors that was caused by this:
The volume of 500 errors dropped immediately after I rolled back to 1.4.5.
Thanks for the performance tip, I'll try to make the object persistent across calls.
I tried to replicate the issue by running gglsbl-rest docker container locally but everything worked as expected. The only way to trigger that error is to manually delete records from "metadata" table in SQLite cache.
I do not see how it can happen in normal course of events as "metadata" table gets populated immediately after creation: https://github.com/afilipovich/gglsbl/blob/master/gglsbl/storage.py#L97
The only explanation I can think of is that container disk was full with just enough space for single cache file, while second cache file could only be created but not populated with data.
By the way, I did implement your suggestion of reusing the SafeBrowsingList
object when possible, and it did have a dramatically positive impact on performance. Thanks for bringing that up.
Did the code changes between 1.4.5 and 1.4.5 significantly increase the disk space usage of the local DB? Maybe that could be the root cause of the problems I observed.
Not really, a few Kb at most... I think the problem just became visible.
So I simulated the scenario where the disk space ended, and sure enough the sqlite driver raises an exception:
Traceback (most recent call last):
File "update.py", line 59, in <module>
update_hash_prefix_cache()
File "/usr/local/lib/python2.7/site-packages/pid/decorator.py", line 14, in decorator
return func(*func_args, **func_kwargs)
File "update.py", line 46, in update_hash_prefix_cache
sbl.update_hash_prefix_cache()
File "/usr/local/lib/python2.7/site-packages/gglsbl/client.py", line 75, in update_hash_prefix_cache
self.storage.populate_hash_prefix_list(response_threat_list, hash_prefix_list)
File "/usr/local/lib/python2.7/site-packages/gglsbl/storage.py", line 314, in populate_hash_prefix_list
dbc.executemany(q, records)
sqlite3.OperationalError: database or disk is full
That didn't happen on the IndexError scenario. 🤔
So I deployed gglsbl-rest 1.2.0 on my production environment again.
Turns out the size increase for a fresh database increased from around 1.7 gigs to 2.1 gigs, about 30% increase.
Strangely, though, I'm not getting the IndexError exceptions immediately after the upgrade as I used to. 🤔
I'll close this for now, and will let you know if this reoccurs. Filing under "heisenbug". 😄
This is a real mystery, I am glad it fixed itself :)
Sure enough, it's back:
[2017-10-27 07:12:51,979] ERROR in app: exception handling [https://yt3.ggpht.com/-djh3mwbyimg/aaaaaaaaaai]
Traceback (most recent call last):
File "/root/gglsbl-rest/app.py", line 34, in _lookup
sbl = SafeBrowsingList(api_key, active['name'], True)
File "/usr/local/lib/python2.7/site-packages/gglsbl/client.py", line 33, in __init__
self.storage = SqliteStorage(db_path)
File "/usr/local/lib/python2.7/site-packages/gglsbl/storage.py", line 60, in __init__
if not self.check_schema_version():
File "/usr/local/lib/python2.7/site-packages/gglsbl/storage.py", line 74, in check_schema_version
v = dbc.fetchall()[0][0]
IndexError: list index out of range
Can you please share .sqlite cache file so I can inspect it? (might be difficult to extract it from Docker)
Right, I think I may have found the issue:
2017-07-03T14:15:58+0000 pid=5 update INFO downloading database to /root/gglsbl-flask/db/gsb_v4.a.db.tmp
Traceback (most recent call last):
File "update.py", line 59, in <module>
update_hash_prefix_cache()
File "/usr/local/lib/python2.7/site-packages/pid/decorator.py", line 14, in decorator
return func(*func_args, **func_kwargs)
File "update.py", line 46, in update_hash_prefix_cache
sbl.update_hash_prefix_cache()
File "/usr/local/lib/python2.7/site-packages/gglsbl/client.py", line 72, in update_hash_prefix_cache
self.storage.populate_hash_prefix_list(response_threat_list, hash_prefix_list)
File "/usr/local/lib/python2.7/site-packages/gglsbl/storage.py", line 271, in populate_hash_prefix_list
dbc.executemany(q, records)
sqlite3.OperationalError: disk I/O error
This is different from not enough disk space... any idea what might cause this exception to be raised?
Found this reference here, might help: https://gist.github.com/17twenty/8831301
I found that update.py touches JOURNAL file: https://github.com/mlsecproject/gglsbl-rest/blob/master/update.py#L50
This is kinda unusual as JORNAL file is created only for the duration of transaction. If there is JOURNAL file when update.py switches cache DBs, it is likely that transaction is is progress and moving files around may corrupt DB.
Interesting. So the journal file will eventually be deleted by itself? If I find one should I wait until it is no longer there?
I'd do some tests to make sure it gets auto-deleted. However if sqlite connection is closed, no transaction should be running...
How about adding a close
method to the SafeBrowsingList
class that ensures any pending transactions are committed and that the sqlite connection is closed? Then I could call that before moving the file in update.py.
Again, rolling back to using 1.4.5 immediately solved everything:
It's possible that 1.4.5 is also broken, but does not raise an exception.
Can you please add del sbl
here:
https://github.com/mlsecproject/gglsbl-rest/blob/master/update.py#L48
It should close sqlite connection.
Good suggestion, will try it and get back to you.
Nope, still having issues:
BTW, I also added another 10 gigs of disk space to each of my ECS instances, so it's likely not disk space issues either since this is 5x the database size of extra space.
Uhm, okay, thanks for giving it a try... Could you please try to comment out this line and see if it solves these errors? https://github.com/afilipovich/gglsbl/blob/master/gglsbl/storage.py#L66
It will likely slow down cache update but should keep DB file consistent.
Did some reading here and was wondering if actually switching to sqlite WAL mode could allow gglsbl-rest to keep a single copy of the database and update it in-place... Should be possible, given how this would allow reads and writes to happen concurrently. What do you think?
Maybe dropping the entire "switching databases over time" logic could do away with the errors we're seeing, and make gglsbl-rest waaaaay simpler to boot.
Also, I just realized Alpine 3.4 (which gglsbl-rest is currently using) is based on sqlite 3.13 and many bugs have been corrected since that. I'll try Alpine 3.6 which has sqlite 3.20.1 and see if that helps at all.
Wow, this solved it. Apparently some fix between sqlite 3.13 and 3.20.1 took care of this problem.
I'll update the gglsbl-rest Dockerfile to use Alpine 3.6 and am finally closing this now.
I am using gglsbl-rest for queries at a large scale, and after upgrading to 1.4.6 I've started seeing the following errors in my logs:
Still trying to get steps that reliably reproduce this issue, will post them here if I get them.