bcgov / von

Verifiable Organizations Network
https://digital.gov.bc.ca/digital-trust
Other
51 stars 22 forks source link

Pool Timeout error when pushing file to Tails Server #354

Closed esune closed 3 years ago

esune commented 4 years ago

When generating a new revocation registry and associated file, the agent uploads the tails file to the Tails Server after publishing the registry information to the ledger, so that it will be publicly available for verifiers to use.

During the execution of this flow - however - the following error was raised by the tails server at the time of the file upload, causing the tails file to not be published to the tails server AND interrupting the revocation management workflow before the second "rolling" revocation registry was created:

2020-09-01 16:51:02,036 indy_vdr.bindings WARNING Library not loaded from python package
2020-09-01 16:51:22,583 aiohttp.server ERROR Error handling request
Traceback (most recent call last):
  File "/home/indy/.local/lib/python3.7/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/home/indy/.local/lib/python3.7/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/home/indy/tails_server/web.py", line 72, in put_file
    genesis_txn_bytes, revocation_reg_id, storage_path
  File "/home/indy/tails_server/ledger.py", line 29, in get_rev_reg_def
    pool = await indy_vdr.open_pool(transactions_path=tmp_file.name)
  File "/home/indy/.local/lib/python3.7/site-packages/indy_vdr/pool.py", line 167, in open_pool
    await pool.refresh()
  File "/home/indy/.local/lib/python3.7/site-packages/indy_vdr/pool.py", line 52, in refresh
    await bindings.pool_refresh(self.handle)
indy_vdr.error.VdrError: Pool timeout

We need to try understanding WHY this happened and how to prevent it from happening again in the future as it is disruptive (issuers can get in a state where only manual intervention would bring them back up and ready to issue/revoke credentials - this is yet to be tested).

esune commented 4 years ago

See this issue for further info: https://github.com/hyperledger/aries-cloudagent-python/issues/707

esune commented 4 years ago

The investigation for this issue turned to making the agent processes more resilient, in order to recover from situations like the one described above. The release of aca-py 0.5.5 will have the logic supporting the improvements.

esune commented 4 years ago

@andrewwhitehead this ended up being an investigation/fix task on aca-py to make it more resilient. Do we need to log something to look into the tails server code as well?

If I remember correctly you mentioned this was possibly caused by having to fetch the delta of transactions between what is reported on the genesis file and what is currently on the ledger.

andrewwhitehead commented 4 years ago

@esune There are changes that could be made to optimize the tails server, maybe it would make a good help-wanted issue. The biggest improvement would be caching the transactions that are retrieved after a pool refresh according to the genesis file that prompted the connection. This only makes aca-py more resilient in that it reduces the time required to sync up with a specific ledger, though (after the first connection).

esune commented 3 years ago

Posted issue in indy-tails-server repo: https://github.com/bcgov/indy-tails-server/issues/12