Closed warner closed 3 years ago
I'm still investigating what happened leading up to this point.
@JimLarson points out this might just be an LMDB overrun: http://www.lmdb.tech/doc/group__mdb.html#gaa2506ec8dab3d969b0e609cd82e619e5
The LMDB database size (basically du -s data.mdb
, which isn't fooled by the sparseness of the database) grew to 2.15GB by the time it died, which might be the size we pre-configured the DB to be.
@FUDCo does that ring a bell? I remember we talked about how the LMDB size was pre-selected, but we could probably grow it later.
We set the allocation to 2GB as a compromise for development because on Windows (or, rather, WSL) it actually allocates a file of the full size and so eats disk space. However, we can set the size arbitrarily large and we can increase the size between cranks.
I think this is below the layers of abstraction that I deal with, so I'm going to remove myself, but please let me know if I can help at all.
Is there more work to do on this? Or would you like to close it, @FUDCo ?
I promoted the diagnosis to the title.
One of my issue maintenance habits is:
Sometimes you just see the same symptoms but you don't have a diagnosis yet. Keep it separate until you have a diagnosis. If the diagnosis is the same, close as dup. (well, github doesn't insitutionalize "resolution: fixed" vs "resolution: duplicate" like trac does... but it does have labels)
Should be dealt with via moving transcripts out
The phase3 testnet halted this morning. In my monitoring node, the last thing I observe is a consensus failure:
This was preceeded by the Zoe vat being terminated because of an allocation meter overrun:
etc..