Closed vorpal-buildbot closed 5 years ago
An example:
500 error at /seasons/7/cards/Fool's Tome/
Failed to execute SELECT * FROM _cache_card AS c WHERE ((1 = 1) AND c.name IN ('Fool''s Tome')) with [] because of (1146, "Table 'cards._cache_card' doesn't exist")
Reported on decksite by logged_out
Request Data ```
Request Method: GET
Path: /seasons/7/cards/Fool's Tome/?
Cookies: {}
Endpoint: seasons.card
View Args: {'name': "Fool's Tome"}
Person: logged_out
Referrer: None
Request Data: {}
Host: pennydreadfulmagic.com
Accept-Encoding: gzip
Cf-Ipcountry: US
X-Forwarded-For: 46.229.168.145, 172.69.62.252
Cf-Ray: 4efff0bfffbcc198-IAD
X-Forwarded-Proto: https
Cf-Visitor: {"scheme":"https"}
Accept: text/html, application/rss+xml, application/atom+xml, text/xml, text/rss+xml
User-Agent: Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html)
Cf-Connecting-Ip: 46.229.168.145
Cdn-Loop: cloudflare
X-Forwarded-Host: pennydreadfulmagic.com
X-Forwarded-Server: pennydreadfulmagic.com
Connection: Keep-Alive
</details>
--------------------------------------------------------------------------------
<details><summary>
DatabaseException
Failed to execute `SELECT * FROM _cache_card AS c WHERE ((1 = 1) AND c.name IN ('Fool''s Tome'))` with `[]` because of `(1146, "Table 'cards._cache_card' doesn't exist")`
</summary>
Stack Trace:
```Python traceback
File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 2328, in __call__
return self.wsgi_app(environ, start_response)
File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 2314, in wsgi_app
response = self.handle_exception(e)
File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 2311, in wsgi_app
response = self.full_dispatch_request()
File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 1834, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 1737, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/home/discord/.local/lib/python3.6/site-packages/flask/_compat.py", line 36, in reraise
raise value
File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 1832, in full_dispatch_request
rv = self.dispatch_request()
File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 1818, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "./decksite/cache.py", line 79, in decorated_function
response = make_response(f(*args, **kwargs))
File "./decksite/main.py", line 158, in card
c = cs.load_card(oracle.valid_name(urllib.parse.unquote_plus(name)), season_id=get_season_id())
File "./decksite/data/card.py", line 67, in load_card
c = guarantee.exactly_one(oracle.load_cards([name]))
File "./magic/oracle.py", line 45, in load_cards
rs = db().select(sql)
File "./shared/database.py", line 41, in select
[_, rows] = self.execute_anything(sql, args)
File "./shared/database.py", line 60, in execute_anything
raise DatabaseException('Failed to execute `{sql}` with `{args}` because of `{e}`'.format(sql=sql, args=args, e=e))
Theoretically this happens atomically because of the RENAME TABLE found here that swaps in a new table in the same action as swapping out the old one.
def update_cache() -> None:
db().execute('DROP TABLE IF EXISTS _new_cache_card')
db().execute('SET group_concat_max_len=100000')
db().execute(create_table_def('_new_cache_card', card.base_query_properties(), base_query()))
db().execute('CREATE UNIQUE INDEX idx_u_card_id ON _new_cache_card (card_id)')
db().execute('CREATE UNIQUE INDEX idx_u_name ON _new_cache_card (name(142))')
db().execute('CREATE UNIQUE INDEX idx_u_names ON _new_cache_card (names(142))')
db().execute('DROP TABLE IF EXISTS _old_cache_card')
db().execute('CREATE TABLE IF NOT EXISTS _cache_card (_ INT)') # Prevent error in RENAME TABLE below if bootstrapping.
**** ATOMIC SWAMP **** => db().execute('RENAME TABLE _cache_card TO _old_cache_card, _new_cache_card TO _cache_card')
db().execute('DROP TABLE IF EXISTS _old_cache_card')
I'm sure this used to work. Gonna try and get the date this started. Is this a mariadb thing??
mariadb docs claim it is not guilty. https://mariadb.com/kb/en/library/rename-table/
The rename operation is done atomically, which means that no other session can access any of the tables while the rename is running. For example, if you have an existing table old_table, you can create another table new_table that has the same structure but is empty, and then replace the existing table with the empty one as follows (assuming that backup_table does not already exist):
CREATE TABLE new_table (...);
RENAME TABLE old_table TO backup_table, new_table TO old_table;
Could "no other session can access any of the tables while the rename is running" mean it will seem to not exist? That seems like a pretty dubious interpretation of that sentence. I'd expect a timeout waiting not a "that doesn't exist".
There are various historical dates on which this happened hundreds of times. They are all one-offs in that they are followed by days/weeks/months of no occurrences afterwards. Where this changes is in May: Jan 19, Mar 27, Apr 24, May 18, May 19, May 20, May 21 … right up to present. So we're looking for something that changed May 18. Notably Jun 1-Jun 11 was without incident. And it's not every day since Jun 11 but it is most of them.
Only May 18 PR is a matplotlib requires.io update.
May 17 has a similar flask update.
These do not seem incriminating.
All recent occurrences are juuust before 3.30am except a few at just after 4am. Pacific time.
That goes all the way back through all the June occurrences and the late May occurrences.
3:30AM pacific is 8.30pm on the same day AEST which is the server time.
crontab is:
### Price Grabber
01 * * * * cd /home/discord/prices/ && ./run.sh
# Website Maintenance
28 * * * * cd /home/discord/decksite/ && /bin/bash bootstrap.sh scraper all >scraper.log 2>&1
05 0 * * * cd /home/discord/decksite/ && /bin/bash bootstrap.sh maintenance all >maintenance.log 2>&1
45 * * * * cd /home/discord/decksite/ && /bin/bash bootstrap.sh maintenance hourly >maintenance-hourly.log 2>&1
# rotation checker
00 * * * * cd /home/discord/legality/Legality\ Generator/ && ./runLegalityCheck.sh
# Gatherling test site
00 */4 * * * cd /home/discord/public_html/gatherling/ && ./update.sh
# modo_bugs
54 * * * * cd /home/discord/bughooks/ && /bin/bash bootstrap.sh modo_bugs
that 28 mins past sticks out like a sore thumb. All the "just before 3:30AM" are at 3:28 or 3:29AM.
Aha!
db().execute('DROP TABLE IF EXISTS _cache_card')
is in update_database.
That's a case where we DON'T swap it out.
Theory:
The first scraper run after scryfall's daily update to their bulk data forces a db update which makes the table unavailable for up to a minute or two.
I am pretty sure this is right!
We used to not care about this too much because under mtgjson updates were a handful of times per year. But scryfall's bulk-data is updated most days but not all days which fits perfectly with what we see.
We need to make update_database smarter? Stop dropping the whole db on every updated? Having the whole site go down for a couple of minutes every day seems like a bad idea.
One possibility here is to remove the FK associations in _cache_card and leave it "up" while we rebuild the rest of the db then build a _new_cache_card and swap it out as we do in update_cache
or even just call update_cache
. Would that keep things humming or would we just see some other failure when we try to join _cache_card or similar?
So instead of dropping _cache_card first to allow the db rebuild we can do
SET FOREIGN_KEY_CHECKS=0
… do the necessary …
SET FOREIGN_KEY_CHECKS=1
update_cache()
I'm gonna try and test this on local and see if it functions.
So, definitely a big slowdown on some of my requests while the scrapers all
task was running (which DID kick off a db update because I forced it to by changing 'has scryfall updated' to 'True') but no errors. So I think in principle this is a fix for the above issue.
I wonder if a full update_cache
is unnecessarily expensive and if there's something "smaller" I can do to get a new _cache_card but I think I'll just proceed with this?
Tagging @tsutton who has been in this code (to make it so much faster) recently in case they see something I'm missing.
f4ca6658
Reported on Discord by bakert#2193