WebOfTrust / keria

KERI Agent in the cloud
https://keria.readthedocs.io/en/latest/
Apache License 2.0
14 stars 26 forks source link

Intermittent `lmdb.DiskError No space left on device` #182

Open nkongsuwan opened 5 months ago

nkongsuwan commented 5 months ago

I found lmdb.DiskError that seems to appear intermittently while I was working with multi-signature AIDs. I could not figure out a script to reproduce this error reliably. Here is the stack trace.

Note: My .keri directory is only 29M, and my laptop still has plenty of disk space left (82 GB to be precise). This problem (temporarily?) goes away after I restart my laptop.

The Agency is loaded and waiting for requests...
  Agent: ECwUDnrtD9jG1uUlPg3F3RrtpxVz5P6n6GrMTCEmwiVp   Controller: EDTlVxcK_Db1eF2IvMj6OxO8V7hvuoC4ZPRl2lxZoNI_
  Agent: EMgJCqrFxQF6WX8bTahRYB40POnah1uDBezp3-o92q2r   Controller: EKXJAJYZphiawYR_4SzgBjOdOpOwXs2zWgF680gwYkpv
2024-02-04 09:29:55 [FALCON] [ERROR] POST /boot => Traceback (most recent call last):
  File "falcon/app.py", line 365, in falcon.app.App.__call__
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keria/src/keria/app/agenting.py", line 825, in on_post
    agent = self.agency.create(caid=caid)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keria/src/keria/app/agenting.py", line 200, in create
    agentHab = agentHby.makeHab(f"agent-{caid}", ns="agent", transferable=True, delpre=caid)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keripy/src/keri/app/habbing.py", line 432, in makeHab
    hab.make(**kwa)
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keripy/src/keri/app/habbing.py", line 2267, in make
    verfers, digers = self.mgr.incept(icount=icount,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keripy/src/keri/app/keeping.py", line 1033, in incept
    raise ValueError("Already incepted pre={}.".format(pre.decode("utf-8")))
ValueError: Already incepted pre=DL-TqR3J37pZtQmUJfxfwB6FzlGOl6Y-4uq-xfyU3lrW.

2024-02-04 09:29:55 [FALCON] [ERROR] GET /agent/EAvUaZGWb3e4V41_9Nuo4CEUI8AKRRcrp9qnQTJD_soR => Traceback (most recent call last):
  File "falcon/app.py", line 365, in falcon.app.App.__call__
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keria/src/keria/app/aiding.py", line 88, in on_get
    agent = self.agency.get(caid)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keria/src/keria/app/agenting.py", line 238, in get
    ks = keeping.Keeper(name=caid,
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keripy/src/keri/app/keeping.py", line 253, in __init__
    super(Keeper, self).__init__(headDirPath=headDirPath, perm=perm,
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keripy/src/keri/db/dbing.py", line 349, in __init__
    super(LMDBer, self).__init__(**kwa)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/hio/base/filing.py", line 150, in __init__
    self.reopen(clear=clear, reuse=reuse, clean=clean, **kwa)
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keripy/src/keri/app/keeping.py", line 260, in reopen
    opened = super(Keeper, self).reopen(**kwa)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keripy/src/keri/db/dbing.py", line 383, in reopen
    self.env = lmdb.open(self.path, max_dbs=self.MaxNamedDBs, map_size=104857600,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
lmdb.DiskError: /Users/nkongsuwan/.keri/ks/EAvUaZGWb3e4V41_9Nuo4CEUI8AKRRcrp9qnQTJD_soR: No space left on device
nkongsuwan commented 5 months ago

Another variant of the error. lmdb.BadRslotError: mdb_txn_renew: MDB_BAD_RSLOT: Invalid reuse of reader locktable slot. I think I get these errors when I repeatedly clear .keri directory and rerun keria and witnesses.

2024-02-06 09:03:00 [FALCON] [ERROR] POST /boot => Traceback (most recent call last):
  File "falcon/app.py", line 365, in falcon.app.App.__call__
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keria/src/keria/app/agenting.py", line 825, in on_post
    agent = self.agency.create(caid=caid)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keria/src/keria/app/agenting.py", line 200, in create
    agentHab = agentHby.makeHab(f"agent-{caid}", ns="agent", transferable=True, delpre=caid)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/keri/app/habbing.py", line 432, in makeHab
    hab.make(**kwa)
  File "/opt/homebrew/lib/python3.11/site-packages/keri/app/habbing.py", line 2267, in make
    verfers, digers = self.mgr.incept(icount=icount,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/keri/app/keeping.py", line 1033, in incept
    raise ValueError("Already incepted pre={}.".format(pre.decode("utf-8")))
ValueError: Already incepted pre=DL-TqR3J37pZtQmUJfxfwB6FzlGOl6Y-4uq-xfyU3lrW.

  Agent: EOR3-BfyZ_2K5nlX_0StykwsMe49Ucgi_C8wJG9fSFOO   Controller: EA53LajRFsNMuIeB3opuTDyZkaERWa97VHygycbtEjSJ
2024-02-06 09:03:00 [FALCON] [ERROR] PUT /agent/EA53LajRFsNMuIeB3opuTDyZkaERWa97VHygycbtEjSJ?type=ixn => Traceback (most recent call last):
  File "falcon/app.py", line 365, in falcon.app.App.__call__
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keria/src/keria/app/aiding.py", line 136, in on_put
    ixn = self.interact(req, rep, agent, caid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keria/src/keria/app/aiding.py", line 218, in interact
    agent.agentHab.kvy.processEvent(serder=ixn, sigers=sigers)
  File "/opt/homebrew/lib/python3.11/site-packages/keri/core/eventing.py", line 3257, in processEvent
    eserder = self.fetchEstEvent(pre, sn)  # latest est event wrt sn
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/keri/core/eventing.py", line 4189, in fetchEstEvent
    dig = bytes(self.db.getKeLast(key=snKey(pre, sn)))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/keri/db/basing.py", line 2285, in getKeLast
    return self.getIoValLast(self.kels, key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/keri/db/dbing.py", line 1473, in getIoValLast
    with self.env.begin(db=db, write=False, buffers=True) as txn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
lmdb.BadRslotError: mdb_txn_renew: MDB_BAD_RSLOT: Invalid reuse of reader locktable slot

2024-02-06 09:03:01 [FALCON] [ERROR] POST /identifiers => Traceback (most recent call last):
  File "falcon/app.py", line 375, in falcon.app.App.__call__
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keria/src/keria/core/authing.py", line 208, in process_response
    headers = self.authn.sign(agent, Hict(rep.headers), req.method, req.path)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nkongsuwan/Workspace/WebOfTrust/keria/src/keria/core/authing.py", line 130, in sign
    header, qsig = ending.siginput("signify", method, path, headers, fields=fields, hab=agent.agentHab,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/keri/end/ending.py", line 340, in siginput
    sigers = hab.sign(ser=ser,
             ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/keri/app/habbing.py", line 1335, in sign
    return self.mgr.sign(ser=ser,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/keri/app/keeping.py", line 1346, in sign
    if self.aeid and not self.decrypter:
       ^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/keri/app/keeping.py", line 839, in aeid
    return self.ks.gbls.get('aeid')
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/keri/db/subing.py", line 258, in get
    val = self.db.getVal(db=self.sdb, key=self._tokey(keys))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/keri/db/dbing.py", line 456, in getVal
    with self.env.begin(db=db, write=False, buffers=True) as txn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
lmdb.BadRslotError: mdb_txn_renew: MDB_BAD_RSLOT: Invalid reuse of reader locktable slot
psteniusubi commented 4 months ago

Hi @nkongsuwan. I have not seen any lmdb issues in my test setup in a long time. When I did, I was usually able to trace things back to my own mistakes and there being mulitple processes running accessing the same lmdb files.

iFergal commented 3 months ago

I am seeing a different but similar error on an M3 Macbook (no disk space issues). This happens quite a bit but doesn't happen on Ubuntu instances I have in the cloud.

2024-03-22 08:20:01 [FALCON] [ERROR] POST /boot => Traceback (most recent call last):
  File "falcon/app.py", line 365, in falcon.app.App.__call__
  File "/Users/fergaloconnor/dev/keri/cardano-foundation/keria/src/keria/app/agenting.py", line 826, in on_post
    agent = self.agency.create(caid=caid)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/fergaloconnor/dev/keri/cardano-foundation/keria/src/keria/app/agenting.py", line 177, in create
    ks = keeping.Keeper(name=caid,
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/fergaloconnor/dev/keri/cardano-foundation/keria/venv/lib/python3.12/site-packages/keri/app/keeping.py", line 253, in __init__
    super(Keeper, self).__init__(headDirPath=headDirPath, perm=perm,
  File "/Users/fergaloconnor/dev/keri/cardano-foundation/keria/venv/lib/python3.12/site-packages/keri/db/dbing.py", line 350, in __init__
    super(LMDBer, self).__init__(**kwa)
  File "/Users/fergaloconnor/dev/keri/cardano-foundation/keria/venv/lib/python3.12/site-packages/hio/base/filing.py", line 150, in __init__
    self.reopen(clear=clear, reuse=reuse, clean=clean, **kwa)
  File "/Users/fergaloconnor/dev/keri/cardano-foundation/keria/venv/lib/python3.12/site-packages/keri/app/keeping.py", line 260, in reopen
    opened = super(Keeper, self).reopen(**kwa)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/fergaloconnor/dev/keri/cardano-foundation/keria/venv/lib/python3.12/site-packages/keri/db/dbing.py", line 384, in reopen
    self.env = lmdb.open(self.path, max_dbs=self.MaxNamedDBs, map_size=104857600,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
lmdb.Error: /Users/fergaloconnor/.keri/ks/EIMgR8Lc8jH9Ieubc0gOCLSYh8N66R_4Bwy0lmZ0XI8f: Too many open files

Possibly an issue with too many file descriptions, maybe from not being cleaned up/closed in keripy.

It's pretty random on when it happens - e.g. creating AID, OOBI exchange, some IPEX - is mostly what I've been doing lately.

SmithSamuelM commented 3 months ago

You might need to change the maxfiles limit for your macos.

https://wilsonmar.github.io/maximum-limits/

The default sof limit on macos is only 256. Not sure how the soft limit is adaptively adjusted but some suggest explicitly setting both the soft and hard limits The default hard limit is unlimited but when you set the limits you have to set the hard limit to some number not unlimited which maximum is 1048576 i believe

iFergal commented 3 months ago

Thanks! Trying with 65536 as the soft limit and will see if it occurs again.

2byrds commented 3 weeks ago

Frustrating error! Thank you @SmithSamuelM for posting the fix. @iFergal or @nkongsuwan I don't how it will go in the long-term but it seems to work for me in the short-term. How about you?

iFergal commented 3 weeks ago

@2byrds This resolved the Too many open files issue - I haven't seen it since.

I recently saw the No space left on device issue when creating new Signify clients in a for loop on my local Macbook (I was doing some testing on the high CPU usage issue - happening after only a few hundred clients, more testing to do there). (and I had lots more space on my disk)

It otherwise hasn't happened me since I usually delete ~/.keri frequently locally - on any Ubuntu cloud VMs I have I've never seen it.