bcgov / TheOrgBook

A public repository of verifiable claims about organizations. A key component of the Verifiable Organization Network.
http://von.pathfinder.gov.bc.ca
Apache License 2.0
78 stars 66 forks source link

Processing multiple related credentials in a single transaction can cause deadlocks #812

Closed WadeBarnes closed 4 years ago

WadeBarnes commented 5 years ago

Occasionally there situations where a number of related credentials are posted to the OrgBook for processing. When the credentials are processed in a single transaction there is the potential for a deadlock condition since there is a high likelihood that the changes will affect the same records. An example of this is below.

This does not occur very often. In fact this is the first time I've seen this occur since we started the project. For now we, should ensure we are receiving notifications of posting errors from the agents and monitor the situation and document how often this may occur.

Workaround:

TOB Stack trace:

ERROR 2019-06-18 15:35:14,964 web_protocol 1 139661072897792 Error handling request
Traceback (most recent call last):
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/site-packages/django/db/backends/base/base.py", line 240, in _commit
    return self.connection.commit()
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 786 waits for ShareLock on transaction 9634347; blocked by process 787.
Process 787 waits for ShareLock on transaction 9634346; blocked by process 786.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (16370,72) in relation "topic"
SQL statement "SELECT 1 FROM ONLY "public"."topic" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/site-packages/aiohttp/web_protocol.py", line 390, in start
    resp = await self._request_handler(request)
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/site-packages/aiohttp/web_app.py", line 366, in _handle
    resp = await handler(request)
  File "/home/indy/api_indy/tob_anchor/views.py", line 208, in store_credential
    client, indy_holder_id(), params, processor, issuer_did)
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/site-packages/vonx/web/view_helpers.py", line 296, in perform_store_credential
    stored_row, ret_row = proc.result()
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/site-packages/vonx/web/view_helpers.py", line 271, in _store_credential
    await processor.process_credential_async(stored, origin_did, batch_info)
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/site-packages/vonx/web/view_helpers.py", line 64, in process_credential_async
    return await asyncio.wrap_future(fut)
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/indy/api_indy/tob_anchor/boot.py", line 112, in run_django_proc
    return proc(*args)
  File "/home/indy/api_indy/tob_anchor/processor.py", line 63, in proc
    return credential_manager.process(cred, origin_did)
  File "/home/indy/api_indy/indy/credential.py", line 384, in process
    return self.populate_application_database(credential_type, credential)
  File "/home/indy/api_indy/indy/credential.py", line 779, in populate_application_database
    credential_type.save()
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/site-packages/django/db/transaction.py", line 240, in __exit__
    connection.commit()
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/site-packages/django/db/backends/base/base.py", line 262, in commit
    self._commit()
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/site-packages/django/db/backends/base/base.py", line 240, in _commit
    return self.connection.commit()
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/indy/.pyenv/versions/3.6.8/lib/python3.6/site-packages/django/db/backends/base/base.py", line 240, in _commit
    return self.connection.commit()
django.db.utils.OperationalError: deadlock detected
DETAIL:  Process 786 waits for ShareLock on transaction 9634347; blocked by process 787.
Process 787 waits for ShareLock on transaction 9634346; blocked by process 786.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (16370,72) in relation "topic"
SQL statement "SELECT 1 FROM ONLY "public"."topic" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"

INFO 2019-06-18 15:35:14,966 helpers 1 139661072897792 172.51.20.1 [18/Jun/2019:15:35:13 +0000] "POST /api/v2/indy/store-credential HTTP/1.1" 500 360 "-" "Python/3.6 aiohttp/3.3.2"

BC Registries Agent error report:

[{'record_id': 5133415, 'system_type_cd': 'BC_REG', 'corp_num': '0895507', 'corp_state': 'ACT', 'prev_event': {'event_date': '2019-06-18T01:29:52+00:00', 'event_id': 15142985}, 'last_event': {'event_date': '2019-06-18T01:59:44+00:00', 'event_id': 15143003}, 'credential_type_cd': 'REL', 'credential_id': 'BC0895507', 'schema_name': 'relationship.registries.ca', 'schema_version': '1.0.42', 'credential_json': {'associated_registration_id': 'FM0622475', 'effective_date': '2019-06-18T01:32:50+00:00', 'registration_id': 'BC0895507', 'relationship': 'Owns', 'relationship_description': 'Does Business As', 'relationship_status': 'ACT', 'relationship_status_effective': '2019-06-18T01:32:50+00:00'}, 'credential_hash': '9d6945fb099205cbb6e965f5c695d40cd07d1280d4ef0b5e10fc1ea9585cd342', 'credential_reason': 'Filing:CHGSP', 'entry_date': datetime.datetime(2019, 6, 18, 2, 0, 53, 356325), 'end_date': None, 'process_date': datetime.datetime(2019, 6, 18, 15, 35, 15, 7543), 'process_success': 'N', 'process_msg': 'Bad response from post_json: (500) <html><head><title>500 Internal Server Error</title></head><body><h1>500 Internal Server Error</h1>Server got itself in trouble</body></html>'}, {'record_id': 5133414, 'system_type_cd': 'BC_REG', 'corp_num': '0895507', 'corp_state': 'ACT', 'prev_event': {'event_date': '2019-06-18T01:29:52+00:00', 'event_id': 15142985}, 'last_event': {'event_date': '2019-06-18T01:59:44+00:00', 'event_id': 15143003}, 'credential_type_cd': 'REL', 'credential_id': 'BC0895507', 'schema_name': 'relationship.registries.ca', 'schema_version': '1.0.42', 'credential_json': {'associated_registration_id': 'FM0622475', 'effective_date': '2019-04-30T22:19:43+00:00', 'expiry_date': '2019-06-18T01:32:50+00:00', 'registration_id': 'BC0895507', 'relationship': 'Owns', 'relationship_description': 'Does Business As', 'relationship_status': 'ACT', 'relationship_status_effective': '2019-04-30T22:19:43+00:00'}, 'credential_hash': '756759ec6fdd5d476ee9e8df6c66323ded9ff95803d7914769789c1041542f7e', 'credential_reason': 'Filing:CHGSP', 'entry_date': datetime.datetime(2019, 6, 18, 2, 0, 53, 353150), 'end_date': None, 'process_date': datetime.datetime(2019, 6, 18, 15, 35, 15, 3934), 'process_success': 'N', 'process_msg': 'Bad response from post_json: (500) <html><head><title>500 Internal Server Error</title></head><body><h1>500 Internal Server Error</h1>Server got itself in trouble</body></html>'}, {'record_id': 5133411, 'system_type_cd': 'BC_REG', 'corp_num': 'FM0622475', 'corp_state': 'ACT', 'prev_event': {'event_date': '2019-06-18T01:29:52+00:00', 'event_id': 15142985}, 'last_event': {'event_date': '2019-06-18T01:59:44+00:00', 'event_id': 15143003}, 'credential_type_cd': 'REL', 'credential_id': 'FM0622475', 'schema_name': 'relationship.registries.ca', 'schema_version': '1.0.42', 'credential_json': {'associated_registration_id': 'BC0895507', 'associated_registration_name': 'AUTOFORM PERFORMANCE INC.', 'effective_date': '2019-06-18T01:32:50+00:00', 'registration_id': 'FM0622475', 'relationship': 'IsOwned', 'relationship_description': 'Is Owned By', 'relationship_status': 'ACT', 'relationship_status_effective': '2019-06-18T01:32:50+00:00'}, 'credential_hash': '12efa5c86e223ace4ff39eb770f22a687893a73c92bb22e1adcd7e0ed4e75a84', 'credential_reason': 'Filing:CHGSP', 'entry_date': datetime.datetime(2019, 6, 18, 2, 0, 53, 280179), 'end_date': None, 'process_date': datetime.datetime(2019, 6, 18, 15, 35, 14, 999898), 'process_success': 'N', 'process_msg': 'Bad response from post_json: (500) <html><head><title>500 Internal Server Error</title></head><body><h1>500 Internal Server Error</h1>Server got itself in trouble</body></html>'}, {'record_id': 5133410, 'system_type_cd': 'BC_REG', 'corp_num': 'FM0622475', 'corp_state': 'ACT', 'prev_event': {'event_date': '2019-06-18T01:29:52+00:00', 'event_id': 15142985}, 'last_event': {'event_date': '2019-06-18T01:59:44+00:00', 'event_id': 15143003}, 'credential_type_cd': 'REL', 'credential_id': 'FM0622475', 'schema_name': 'relationship.registries.ca', 'schema_version': '1.0.42', 'credential_json': {'associated_registration_id': 'BC0895507', 'associated_registration_name': '0895507 B.C. LTD.', 'effective_date': '2019-04-30T22:19:43+00:00', 'expiry_date': '2019-06-18T01:32:50+00:00', 'registration_id': 'FM0622475', 'relationship': 'IsOwned', 'relationship_description': 'Is Owned By', 'relationship_status': 'ACT', 'relationship_status_effective': '2019-04-30T22:19:43+00:00'}, 'credential_hash': '95efbf435ac215ee673d6ba72c84173281726429c2a558a97d9b3e46e0df43aa', 'credential_reason': 'Filing:CHGSP', 'entry_date': datetime.datetime(2019, 6, 18, 2, 0, 53, 276591), 'end_date': None, 'process_date': datetime.datetime(2019, 6, 18, 15, 35, 14, 989886), 'process_success': 'N', 'process_msg': 'Bad response from post_json: (500) <html><head><title>500 Internal Server Error</title></head><body><h1>500 Internal Server Error</h1>Server got itself in trouble</body></html>'}]
WadeBarnes commented 5 years ago

@ianco, This appears to be happening fairly often. A few dozen records affected since last week.

ianco commented 5 years ago

@WadeBarnes Chatted with @nrempel about this, I think the move to the new agent will address this kind of problem.

With the current von-x code, the internal threading model combined with multiple processes deployed on openshift make this very difficult to debug. With the new agent design the separation of concerns will be a little bit cleaner and I think we can avoid this kind of issue.

So I'm not sure how much effort we want to expend trying to diagnose with the current codebase ...

WadeBarnes commented 4 years ago

Recent examples of records that cause the deadlock;

Examples (from credential_log):
 record_id | corp_num
-----------+-----------
   5195506 | FM0743811
   5195507 | FM0743811
   5195543 | 0964770
   5195544 | 0964770
   5195847 | FM0653232
   5195848 | FM0653232
   5195872 | FM0784088
   5195873 | FM0769586
   5195874 | FM0769586
   5195897 | FM0784089
   5195907 | FM0649949
   5195908 | FM0649949
   5195923 | FM0649943
   5195924 | FM0649943
   5195932 | FM0649946
   5195933 | FM0649946
   5195938 | C1218404
   5195939 | C1218404
   5195940 | C1218404
   5195941 | C1218404
   5195947 | FM0784092
   5196140 | FM0783805
   5196141 | FM0783805
   5196145 | FM0784112
   5196163 | 1136244
   5196164 | 1136244
WadeBarnes commented 4 years ago

sample_data.txt

The attached examples are from the schedule agent processing feeds. For this data to actually trigger the deadlock data for the corp may need to pre-exist in the TOB database.