We get little bursts (daily, all at the same time) of 500 errors because "table _cache_card does not exist". This should not happen

vorpal-buildbot commented 5 years ago

Reported on Discord by bakert#2193

bakert commented 5 years ago

An example:

500 error at /seasons/7/cards/Fool's Tome/

Failed to execute SELECT * FROM _cache_card AS c WHERE ((1 = 1) AND c.name IN ('Fool''s Tome')) with [] because of (1146, "Table 'cards._cache_card' doesn't exist")

Reported on decksite by logged_out

Request Data ```
Request Method: GET
Path: /seasons/7/cards/Fool's Tome/?
Cookies: {}
Endpoint: seasons.card
View Args: {'name': "Fool's Tome"}
Person: logged_out
Referrer: None
Request Data: {}
Host: pennydreadfulmagic.com
Accept-Encoding: gzip
Cf-Ipcountry: US
X-Forwarded-For: 46.229.168.145, 172.69.62.252
Cf-Ray: 4efff0bfffbcc198-IAD
X-Forwarded-Proto: https
Cf-Visitor: {"scheme":"https"}
Accept: text/html, application/rss+xml, application/atom+xml, text/xml, text/rss+xml
User-Agent: Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html)
Cf-Connecting-Ip: 46.229.168.145
Cdn-Loop: cloudflare
X-Forwarded-Host: pennydreadfulmagic.com
X-Forwarded-Server: pennydreadfulmagic.com
Connection: Keep-Alive

</details>
--------------------------------------------------------------------------------
<details><summary>
DatabaseException
Failed to execute `SELECT * FROM _cache_card AS c WHERE ((1 = 1) AND c.name IN ('Fool''s Tome'))` with `[]` because of `(1146, "Table 'cards._cache_card' doesn't exist")`
</summary>
Stack Trace:

```Python traceback
  File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 2328, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 2314, in wsgi_app
    response = self.handle_exception(e)
  File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 2311, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 1834, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 1737, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/discord/.local/lib/python3.6/site-packages/flask/_compat.py", line 36, in reraise
    raise value
  File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 1832, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/discord/.local/lib/python3.6/site-packages/flask/app.py", line 1818, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "./decksite/cache.py", line 79, in decorated_function
    response = make_response(f(*args, **kwargs))
  File "./decksite/main.py", line 158, in card
    c = cs.load_card(oracle.valid_name(urllib.parse.unquote_plus(name)), season_id=get_season_id())
  File "./decksite/data/card.py", line 67, in load_card
    c = guarantee.exactly_one(oracle.load_cards([name]))
  File "./magic/oracle.py", line 45, in load_cards
    rs = db().select(sql)
  File "./shared/database.py", line 41, in select
    [_, rows] = self.execute_anything(sql, args)
  File "./shared/database.py", line 60, in execute_anything
    raise DatabaseException('Failed to execute `{sql}` with `{args}` because of `{e}`'.format(sql=sql, args=args, e=e))

bakert commented 5 years ago

Theoretically this happens atomically because of the RENAME TABLE found here that swaps in a new table in the same action as swapping out the old one.

def update_cache() -> None:
    db().execute('DROP TABLE IF EXISTS _new_cache_card')
    db().execute('SET group_concat_max_len=100000')
    db().execute(create_table_def('_new_cache_card', card.base_query_properties(), base_query()))
    db().execute('CREATE UNIQUE INDEX idx_u_card_id ON _new_cache_card (card_id)')
    db().execute('CREATE UNIQUE INDEX idx_u_name ON _new_cache_card (name(142))')
    db().execute('CREATE UNIQUE INDEX idx_u_names ON _new_cache_card (names(142))')
    db().execute('DROP TABLE IF EXISTS _old_cache_card')
    db().execute('CREATE TABLE IF NOT EXISTS _cache_card (_ INT)') # Prevent error in RENAME TABLE below if bootstrapping.
**** ATOMIC SWAMP **** =>    db().execute('RENAME TABLE _cache_card TO _old_cache_card, _new_cache_card TO _cache_card')
    db().execute('DROP TABLE IF EXISTS _old_cache_card')

I'm sure this used to work. Gonna try and get the date this started. Is this a mariadb thing??

bakert commented 5 years ago

mariadb docs claim it is not guilty. https://mariadb.com/kb/en/library/rename-table/

The rename operation is done atomically, which means that no other session can access any of the tables while the rename is running. For example, if you have an existing table old_table, you can create another table new_table that has the same structure but is empty, and then replace the existing table with the empty one as follows (assuming that backup_table does not already exist):

CREATE TABLE new_table (...);
RENAME TABLE old_table TO backup_table, new_table TO old_table;

bakert commented 5 years ago

Could "no other session can access any of the tables while the rename is running" mean it will seem to not exist? That seems like a pretty dubious interpretation of that sentence. I'd expect a timeout waiting not a "that doesn't exist".

bakert commented 5 years ago

There are various historical dates on which this happened hundreds of times. They are all one-offs in that they are followed by days/weeks/months of no occurrences afterwards. Where this changes is in May: Jan 19, Mar 27, Apr 24, May 18, May 19, May 20, May 21 … right up to present. So we're looking for something that changed May 18. Notably Jun 1-Jun 11 was without incident. And it's not every day since Jun 11 but it is most of them.

bakert commented 5 years ago

Only May 18 PR is a matplotlib requires.io update.

May 17 has a similar flask update.

These do not seem incriminating.

bakert commented 5 years ago

All recent occurrences are juuust before 3.30am except a few at just after 4am. Pacific time.

bakert commented 5 years ago

That goes all the way back through all the June occurrences and the late May occurrences.

bakert commented 5 years ago

3:30AM pacific is 8.30pm on the same day AEST which is the server time.

crontab is:

### Price Grabber
01 * * * * cd /home/discord/prices/ && ./run.sh
# Website Maintenance
28 * * * * cd /home/discord/decksite/ && /bin/bash bootstrap.sh scraper all >scraper.log 2>&1
05 0 * * * cd /home/discord/decksite/ && /bin/bash bootstrap.sh maintenance all >maintenance.log 2>&1
45 * * * * cd /home/discord/decksite/ && /bin/bash bootstrap.sh maintenance hourly >maintenance-hourly.log 2>&1
# rotation checker
00 * * * * cd /home/discord/legality/Legality\ Generator/ && ./runLegalityCheck.sh
# Gatherling test site
00 */4 * * * cd /home/discord/public_html/gatherling/ && ./update.sh 
# modo_bugs
54 * * * * cd /home/discord/bughooks/ && /bin/bash bootstrap.sh modo_bugs

that 28 mins past sticks out like a sore thumb. All the "just before 3:30AM" are at 3:28 or 3:29AM.

bakert commented 5 years ago

Aha!

db().execute('DROP TABLE IF EXISTS _cache_card')

is in update_database.

That's a case where we DON'T swap it out.

Theory:

The first scraper run after scryfall's daily update to their bulk data forces a db update which makes the table unavailable for up to a minute or two.

I am pretty sure this is right!

bakert commented 5 years ago

We used to not care about this too much because under mtgjson updates were a handful of times per year. But scryfall's bulk-data is updated most days but not all days which fits perfectly with what we see.

We need to make update_database smarter? Stop dropping the whole db on every updated? Having the whole site go down for a couple of minutes every day seems like a bad idea.

bakert commented 5 years ago

One possibility here is to remove the FK associations in _cache_card and leave it "up" while we rebuild the rest of the db then build a _new_cache_card and swap it out as we do in update_cache or even just call update_cache. Would that keep things humming or would we just see some other failure when we try to join _cache_card or similar?

bakert commented 5 years ago

So instead of dropping _cache_card first to allow the db rebuild we can do

SET FOREIGN_KEY_CHECKS=0 … do the necessary … SET FOREIGN_KEY_CHECKS=1 update_cache()

I'm gonna try and test this on local and see if it functions.

bakert commented 5 years ago

So, definitely a big slowdown on some of my requests while the scrapers all task was running (which DID kick off a db update because I forced it to by changing 'has scryfall updated' to 'True') but no errors. So I think in principle this is a fix for the above issue.

I wonder if a full update_cache is unnecessarily expensive and if there's something "smaller" I can do to get a new _cache_card but I think I'll just proceed with this?

bakert commented 5 years ago

Tagging @tsutton who has been in this code (to make it so much faster) recently in case they see something I'm missing.

bakert commented 5 years ago

f4ca6658

PennyDreadfulMTG / Penny-Dreadful-Tools

We get little bursts (daily, all at the same time) of 500 errors because "table _cache_card does not exist". This should not happen #6345