Is there any detail about the problems of lmdb-go?

lurais commented 1 year ago

Hello , I am very interested in the problems you discribed before, and I want to know how I can find it appear again. for I want to confirm that these problems is still exists in other versions of lmdb.

AskAlexSharov commented 1 year ago

I have no context. Don't know what you are talking about. About known problem of lmdb may read a bit here: https://github.com/erthink/libmdbx/blob/bb8f43181783686879219846d64379a04c1430e3/mdbx.h#L1197

lurais commented 1 year ago

This is the detail:https://github.com/ledgerwatch/erigon/wiki/LMDB-freelist-illustrated-guide, is this solved in go but not in c code? I am not sure how to make it appear again , is there any detail about how to make it appear again as I want to make sure that other versions of lmdb does not have this problem?

AskAlexSharov commented 1 year ago

It needs to be solved in C code (and lmdb-go probably does have our patch). But it's not 100% solved - many corner-cases can cause freelist to grow or slowdown. so, it's not really solved in lmdb (in mdbx is a bit better, but still up to some dbsize/deletes_amount/use-case/etc...). better solution - increase pageSize - less pages less maintenance cost of freelist.

lurais commented 1 year ago

So, why don't you figure out it and push lmdb to solve it in C code? Is this problem can not be solved completely in any cases? I wonder how can I make it appear again, Is that just talked here :https://www.mail-archive.com/openldap-bugs@openldap.org/msg03806.html

AskAlexSharov commented 1 year ago

We did patch C code. But for our use-case.
It’s not really solvable for all use-cases, especially without breaking freelist format.
Main problem is: 3.1. recursive dependency o self - to update freelist need update freelist 3.2. Freelist itself is big and may be evicted from pagecache. 3.3. Hard to deal with pages fragmentation.
https://www.mail-archive.com/openldap-bugs@openldap.org/msg03806.html yes it’s related

lurais commented 1 year ago

We did patch C code. But for our use-case.

It’s not really solvable for all use-cases, especially without breaking freelist format.

Main problem is: 3.1. recursive dependency o self - to update freelist need update freelist 3.2. Freelist itself is big and may be evicted from pagecache. 3.3. Hard to deal with pages fragmentation.

https://www.mail-archive.com/openldap-bugs@openldap.org/msg03806.html yes it’s related

Hello, how is the C code pr now? does it merged? And does lmdb intend to solve it in all use-cases or they think it is not a problem for it will appear in very little use cases?

AskAlexSharov commented 1 year ago

We didn't create PR to upstream, because it's not easy. You can find patch in Commits on Nov 10, 2020: https://github.com/ledgerwatch/lmdb-go/commits/master/lmdb/mdb.c

It’s not really solvable for all use-cases, especially without breaking freelist format. Our use-case is DB >> RAM which is rare and we using crypto-hashed keys sometime which is also rare (updating much randomly-distributed pages). Also issue can be partially-mitigated by avoiding OverfolwPages creation: by using values < 2Kb.

lurais commented 1 year ago

You mean that the DB size is much more bigger than RAM so that the pages can not hold all the data, so the data access will need to read or write the hard disk more times?

AskAlexSharov commented 1 year ago

there are many-many different problems:

there is self-recursion: to update freelist need find free pages in freelist and update freelist.
if freelist is big - it's pages may evict from page cache. also it stored in not very compact way - lists of 8bytes_uint64
if need find "consequent N pages" to reduce fragmentation (to insert/update values with OverfowPages) - you need full-scan of freelist in worst case - because current freelist implementation doesn't store any metadata about sequences of pages. And it's quite easy to "use all available sequences in freelist" - then will always do full-scan without result.
limit full-scan by M iterations - is not good idea in-general. you may fall into situation when DB has free-pages but can't use it (because of iterations limit) and always allocate new (growing).
even if you did full-scan of free-list - you can't defragment freelist or split it on chunks (to make updates cheaper by reducing amount of OverflowPages in freelist) - because freelist table key is ID of transaction and there is no gap between such ID's - means can't split record N to 2 records just because can't create 2 keys for new records.
even if you add such gap - need also start this ID not from 0 (otherwise can't split on chunks first record in freelist)

lurais commented 1 year ago

Thanks for your attention but I test that when use a lmdb to test the performance, it cost much more time to read some keys when the size of mdb is bigger than the phisical memory size ,have you meet this problem? for example ,when the physical memory size is 2G and it cost much more time to read some keys when the storage in mdb is more than 2G.

AskAlexSharov commented 1 year ago

lmdb - it's mmap file. can read any article about "what is page fault in mmap": https://biriukov.dev/docs/page-cache/5-more-about-mmap-file-access/#what-is-a-page-fault

lurais commented 1 year ago

lmdb - it's mmap file. can read any article about "what is page fault in mmap": https://biriukov.dev/docs/page-cache/5-more-about-mmap-file-access/#what-is-a-page-fault

as I know , mdb is a branch of lmdb, so ,how do you avoid this problem?

AskAlexSharov commented 1 year ago

no silver bulet

lurais commented 1 year ago

no silver bulet

As I see, if there is no too much data store in lmdb ,the query performance is better than leveldb, so you just use little storage space or just think it not matter?

AskAlexSharov commented 1 year ago

if in your use-case Data << RAM - then probably everything will be "fast enough". Most of edge-case problems will not happen in this use-case. Main problem is - you don't know much about use-case (is it OLAP or OLTP or HighlyParallel-writes or TimeSeries or ... ?).

FYI: LevelDB doesn't support ACID transactions. Geth switched from LevelDB to PebbleDB - it also doesn't support transactions.

Let me quote mdbx maintainer:

libmdbx is [B-tree](https://en.wikipedia.org/wiki/B-tree) based, with [ACID](https://en.wikipedia.org/wiki/ACID) via [MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)+[COW](https://en.wikipedia.org/wiki/Copy-on-write)+[Shadow paging](https://en.wikipedia.org/wiki/Shadow_paging), without [WAL](https://en.wikipedia.org/wiki/Write-ahead_logging), but mostly [non-blocking](https://en.wikipedia.org/wiki/Non-blocking_algorithm) for readers etc.
RockDB is [LSM](https://en.wikipedia.org/wiki/Log-structured_merge-tree) based, with set of features, including compression...
So, ones are both storage engines, but explaining all the differences is only a little easier than explaining how the universe works.

lurais commented 1 year ago

In fact, if we don’t know the type of use case, can we determine it by event tracking, such as recording all read and write frequencies, as far as I know, the eth use case should be relatively certain, right? In addition, the main reason to consider using mdbx is to support ACID? So compared to geth, where is the necessity of ACID for erigon?

AskAlexSharov commented 1 year ago

"the eth use case should be relatively certain, right? " - nope. save 1 new block every 5 seconds is not a problem at all. but then you need exec it, need store MerkleTrie, need store inverted index for tx-hashes, save history of state changes and inverted index about it, etc... erigon also has dedicated indices for ethgetLogs and trace* methods. And also it need serve ~100 different RPC methods.

Humanity using transactions - to save itself from "database is in broken state" issues - it's "data integrity". Erigon has > 50 tables in db. Also it has > 5 databases. Just read some articles/books about databases theory: transactions, isolation levels, durability, etc...

lurais commented 1 year ago

So, compared with geth, is it more important for erigon to guarantee acid? As far as I know, geth does not support this kind of acid , but there are not many cases of database corruption. If data corruption occurs, the data can also be recovered through fast synchronization, so is it worth spending these costs to support acid?

AskAlexSharov commented 1 year ago

in erigon we decided - yes.

lurais commented 1 year ago

in erigon we decided - yes.

for the data organization is different from geth?

AskAlexSharov commented 1 year ago

in erigon we decided - yes.

for the data organization is different from geth?

Because I don't wan't to spend my time on debugging all this classical "transfer 1 eth from account A to account B. After deduction from account A - app crushed. In result 1 eth lost - nobody have it - and nobody know that something is wrong - app just continue working without any warnings/errors/etc..." issues.

If data corruption occurs, the data can also be recovered through fast synchronization - it's false:

you will never know "did corruption or data-loss happen or not". Ok - state can validate by MerkleTrie re-build (expensive). But not all data covered by merkle trie. For example: in mapping of tx-hash to block number - if you loose 1 record - likely you will never know about it - but users will complain.
after several months you will start tell to your users "please resync" to all kind of issues: issues related to partial-commit or to your app bugs, or users mistake, or ... and problems snowball will grow.

This question is very old and has tons of articles in internet - please google. In this comment I described only "Atomicity /Consistency" topics. But there is also "Durability". And there is also "Isolation" - users of RPC will see partial-commit results (invalid data) - and it will gone on the next RPC call (fist RPC call will show lost 1 eth, next will show 1 eth on account B) - and they will give all this unreproducible bugs to you. Enjoy.

You don't need transactions when your app has insert-only 1 table work-load. But ETH client need handle re-orgs, and for 1 new block write: blocks, receipts, state, various indices/mappings, codes of smart-contracts, merkle trie, etc...

erigontech / erigon

Is there any detail about the problems of lmdb-go? #7452