Database growing - Githubissues

runaum commented 10 years ago

I think that it'll be a good idea to return a value from transaction's Put method (may be from some others too) to check MDB_MAP_FULL and resize db when it's needed. Or do it directly in the wrapper (as option, ofc). What do you think about?

I'm sorry for my primitive English.

ronnieoverby commented 9 years ago

+1

ilyalukyanov commented 9 years ago

Hi guys,

I'll have a look at this in a couple of days. It requires some investigation.

ilyalukyanov commented 9 years ago

I've done some research and here is what I've found. We can check MDB_MAP_FULL statuses of mdb_put & mdb_cursor_put, but there is a limitation that env_map_size can be changed only when there are no opened write transactions.

This means that we should either abort all active transactions and then do resizing, or lock any new transactions until all active transactions will abort/commit, do resizing then unlock.

Both options look awkward and inapropriate in real-world applications. Any thoughts?

ilyalukyanov commented 9 years ago

Now the only solution I see for this (similar to - http://fossies.org/linux/misc/postfix-2.11.3.tar.gz/postfix-2.11.3/src/util/slmdb.c).

The solution is to implement a "simplified" subset of functionality, i.e. snigle-transaction CRUDs, which could be useful itself, however this approach still requires some locking.

buybackoff commented 9 years ago

I was writing a reply to your email and then a simple suggestion about EnsureCapacity() pattern from List turned into almost pseudo-code. Too lazy to translate it, here for reference:

Можно использовать стандартный design pattern для List<> в .NET. Перед каждой записью вызывается EnsureCpacity(), и если места больше нет, то внутренний массив увеличивается в (1 + grow_factor) раз.

Эту проверку можно делать внутри каждой операции записи:

Если места нет, то берем именованный (system wide) lock
Войдя в lock, еще раз проверяем, есть ли место (другой процесс мог его уже увеличить)
Если места все еще нет, увеличиваем размер

Нужно почитать C код, чтобы убедиться, что размер MMF никак не ломает логику транзакций, но по идее для изменения/чтения B-tree он никак не нужен.

В документации mdb_env_set_mapsize написано " It may be called at later times if no transactions are active in this process. Note that the library does not check for this condition, the caller must ensure it explicitly."

Учитывая внутреннее устройство LMDB (B+-tree of pages), мы можем смело увеличивать размер MMF с открытыми транзакциями, так как это вообще не влияет на существующие pages.

Дальше в документации: The new size takes effect immediately for the current process but will not be persisted to any others until a write transaction has been committed by the current process. Also, only mapsize increases are persisted into the environment. -> Именованный lock не позволит другим процессам увеличить размер.

If the mapsize is increased by another process, and data has grown beyond the range of the current mapsize, mdb_txn_begin() will return MDB_MAP_RESIZED. This function may be called with a size of zero to adopt the new size. Можно добавить try ... catch в конструктор транзакции и вызвать mdb_env_set_mapsize(0).

Можно поместить функцию EnsureCapacity() в Try..Catch операции записи.

По идее, всё это просто и нужно сделать в C, странно что этого функционала нет как фичи, в большинстве случаев людям нужен именно такой pattern. В каждом месте, где возвращается ошибка "нет места" нужно сделать вызов EnsureCapacity(), а этот функционал добавить как флаг в open_env().

Я могу глубоко ошиаться с логикой выше, поэтому нужен сторонний взгляд.

ilyalukyanov commented 9 years ago

As it's said in docs for mdb_env_set_mapsize:

Some possible errors are:

EINVAL - an invalid parameter was specified, or the environment has an active write transaction.

This corresponds exactly to my tests. When I tried to call mdb_env_set_mapsize while a read-only transaction was active, everything was fine, but with a write transaction I always received EINVAL. Hence, we cannot increase map size while there are active write transactions.

If we tried to implement an auto growth behavior like the one used in List implementation, we would have to do the following:

In order to implement EnsureCapacity(), we would have to know how much space is used inside the environment. Currently I don't see any API that allows to get this value. We can know only its capacity, but not the actual size. If this is true, there is no way to calculate the size. Even iterating everything in the environment and measuring its size wouldn't work as a lot of entries are isolated inside transactions.
The only moment when we know about the overflow is when put or cursor_put return the corresponding error. As it is happening inside write transactions, we would have to wait until all currently running transactions will complete (or abort them) while blocking any new transactions, then resize, then unlock and retry failed transactions. This kind of a lock is inacceptable under high load. Retry logic should remain the responsibility of consuming code as we cannot decide for it.

buybackoff commented 9 years ago

Could we store the state (position) of each opened cursor and behind the scenes close all transactions, resize, reopen transactions and set cursors back to their positions?

ilyalukyanov commented 9 years ago

@hyc Could you please advise on whether it's possible to increase map size automatically as overflow happens or not? If not, what is the best practice for handling MDB_MAP_FULL in multi-thread applications?

buybackoff commented 9 years ago

We could calculate used space from MDB_stat: page size * (sum of all pages). We could check on startup if free size is less than X% and do resize to have some Y% available. For embedded DBs this could work well in most cases since client apps are closed and opened regularly. For server apps the issue is mostly irrelevant, I believe (just set to maximum possible).

ilyalukyanov commented 9 years ago

@buybackoff this wouldn't work as well. Please consider the following pseudo-code example:

var tran = BeginTransaction

var value = counter++;
DoSomething1();

var oldVal = tran.Get("key");
if (oldVal == null || SomeCondition(oldVal))
   tran.Put("key", value);

var value2 = DoSomething2();
tran.Put("key2", value2);

var value3 = DoSomething3();
DoSomething4();

tran.Put("key3", value3);

...

Commit

Imagine that the last Put caused the overflow.

We cannot commit all previous puts as it would be inconsistent from the application's point of view.
We can abbort the transaction, but we cannot retry all the logic that happened before the last put as it could rely on an internal state, db state or anyting else that could change. The code should handle the overflow and retry the way that is suitable for it.

Normally, an application cannot continue from the point that caused the overflow. It can only start the operation from the begininig.

Regarding your second comment, we cannot calculate used space with MDB_stat as we can obtain it only for a database and generally we don't know how many databases do we have and what they are.

ilyalukyanov commented 9 years ago

If we had a way to know currently used space, we could check and resize every time a write transaction is opened. However, locking is still an issue and this doesn't cover long-living transactions.

ilyalukyanov commented 9 years ago

The only solution I think of right now is to introduce some limitations:

Use one environment per thread.
Use atomic writes only, i.e. When we want to put, we always do begin tran -> put -> commit.

This would allow us to transparently resize and retry, but still doesn't help with cursors.

buybackoff commented 9 years ago

There is a function: int mdb_env_stat ( MDB_env * env, MDB_stat * stat ) that returns stats about the whole env, not a single DB. http://symas.com/mdb/doc/group__mdb.html#gaf881dca452050efbd434cd16e4bae255

ilyalukyanov commented 9 years ago

Oh, I somehow missed this one) Thanks)

buybackoff commented 9 years ago

We could implement X%/Y% logic from above before opening a write transaction. Once in a while new transactions will have to wait for existing to finish - much better than closing/opening an app, if setting the size to the absolute possible maximum is not an option.

ilyalukyanov commented 9 years ago

I'm still in doubt about this solution, but I'll add to the discussion what I've just discovered. The problem is that mdb_env_stat provide stats only for commited entries. Hence, the only relevant moment to get this stats is when an environment is opened.

hyc commented 9 years ago

If we had a safe solution for multi-threaded operation, it would have already been implemented inside LMDB itself.

The biggest problem is that when you grow the map it is very likely that the entire mmap will move to a new location in the address space. That's just a function of the OS itself, modulo whatever other dynamically loaded objects and mmaps are active in the process at the time, and totally unpredictable. Naturally this means that if there were any other active threads operating in the environment at that time, they would all crash.

The only workable approach is to enforce that no other threads are active in the environment when changing the size.

re: "If we had a way to know currently used space, we could check and resize every time a write transaction is opened." - you need mdb_env_info(), as well as mdb_env_stat().

buybackoff commented 9 years ago

@ilyalukyanov On startup, the map size property of environment is set as

if (LightningConfig.Environment.DefaultMapSize != LightningConfig.Environment.LibDefaultMapSize)
                this.MapSize = LightningConfig.Environment.DefaultMapSize;
            else
                _mapSize = LightningConfig.Environment.LibDefaultMapSize;

While in config: DefaultMapSize = LibDefaultMapSize; where the former is a constant 2^20

So if I am not missing something, now the property is not set to the actual size of an existing environment on environment creation? It looks like we are missing a call to mdb_env_info() in the environment constructor?

buybackoff commented 9 years ago

One more thing:

mdb_env_stat returns stat only for the main DB, not a combined stat for the environment (sum of pages for all DBs - main + named), and I do not understand yet how to get the list of all named DB.

@hyc is this expected? because in the docs for mdb_env_stat it is written:

Return statistics about the LMDB environment.

But this if for the main DB only as tests show.

In Lightning.NET environment the UsedSize property uses this call and will miss pages from named DBs:

private MDBStat GetStat()
        {
            //EnsureOpened();
            var stat = new MDBStat();
            NativeMethods.Execute(lib => lib.mdb_env_stat(_handle, out stat));
            return stat;
        }

I tried this test with named DBs and 1000 entries and added _env.Flush(true) to ensure that data and metadata are written, but `_env.UsedSize;' return a single page value of 4096. https://github.com/ilyalukyanov/Lightning.NET/blob/master/src/LightningDB.Tests/EnvironmentTests.cs#L147

hyc commented 9 years ago

Yes, this is expected. That's why the env_info call was added.

buybackoff commented 9 years ago

@hyc but how to get total used pages from MDB_envinfo?

In docs it is defined as:

Data Fields

void * me_mapaddr size_t me_mapsize size_t me_last_pgno size_t me_last_txnid unsigned int me_maxreaders unsigned int me_numreaders

I guess from the name that me_last_pgno includes free (that were previously used) pages as well?

hyc commented 9 years ago

Yes, use me_last_pgno. You'll have to subtract the number of free pages. See the mdb_stat.c source code.

buybackoff commented 9 years ago

@ilyalukyanov What I did:

First, named autoreset event as an environment member:

/// <summary>
/// AutoReset named wait handle to ensure a single writer accross processes
/// </summary>
internal EventWaitHandle WrtTxnGate;

.... in ctor:
// create opened
WrtTxnGate = new EventWaitHandle(true, EventResetMode.AutoReset,
       (Directory + "_wrtTxnGate").CalculateMD5Hash()); // Hash to avoid length and special chars problems

Then, in transaction manager:

public async Task<Transaction> CreateAsync(TransactionBeginFlags beginFlags) {
            EnsureEnvironmentOpened();
            IntPtr handle = default(IntPtr);
            if (!beginFlags.HasFlag(TransactionBeginFlags.ReadOnly)) {
                await _environment.WrtTxnGate.WaitOneAsync();       // WRITER WAIT HERE
                // Writers will wait here and after the gate there is only one 
                // writer even accross the processes

                // TODO ensure capacity here - could safely call set map size
                // Start with info call on every txn, optimize later
                var dirtySize = _environment.GetEnvInfo().me_last_pgno.ToInt64() * _environment.PageSize;
                var mapSize = _environment.MapSize;
                if (mapSize - dirtySize < Config.Environment.MinimumCleanSize) {
                    _environment.MapSize = mapSize + Config.Environment.MinimumCleanSize * 2; // on average, we will have clean size between MinimumCleanSize and 2x of that, sometimes less than 1x but for very short time
                }
            }
            NativeMethods.Execute(lib => lib.mdb_txn_begin(_environment._envHandle,
                IntPtr.Zero, beginFlags, out handle)); //_parentHandle

            var tran = new Transaction(_environment, handle, beginFlags); //_parentTransaction, 
            _transactions.Add(tran);
            return tran;
        }

In Transaction.cs:

... Set ARE on Commit and Abort, e.g.
NativeMethods.Execute(lib => lib.mdb_txn_commit(_handle));
                    if (!this.IsReadOnly) {
                        this.Environment.WrtTxnGate.Set(); // ALLOW NEXT WRITER
                    }

In Config.Environment I set Config.Environment.MinimumCleanSize to the default DB size.

In my fork I force use of NoLock and NoTls in Environment constructor (BTW, you are missing several open flags).

openFlags = openFlags 
                | EnvironmentOpenFlags.NoTls 
                | EnvironmentOpenFlags.NoLock;

LMDB has single writer, so I just do all locking on .NET with the cross-process wait handle. Simple tests work and size is increasing smoothly. Additional locking on top of built-in should not be a big problem for IO bound operations, I guess.

Currently I store large items only (several KB) and haven't tested on many small transactions. To avoid querying stat on every txn, we could use a clean space estimate that is updated to true size after every resize and then atomically decremented by the size of new inserts multiplied by some factor, e.g. 1.2. (initially I tested this approach and it works in principle).

Currently nested transactions are not accounted for (I just erased this functionality in my fork), but I believe for nested ones we just do not wait for the ARE and turn off resizing logic.

I could prepare a PR for the upstream if you are OK with this?

ilyalukyanov commented 9 years ago

@hyc thank you for information

@buybackoff 1) Regarding _mapSize = LightningConfig.Environment.LibDefaultMapSize; we don't need a call there as we already know that by default env's size is equal to LightningConfig.Environment.LibDefaultMapSize.

2) Regarding your solution for auto growing, IMHO we sacrifice much more than we gain. Nested transactions and multithreading are much more useful than built-in autogrowing.

Thanks for pointing, it looks like the size-computing logic is incorrect and some env flags are missing. Will fix.

buybackoff commented 9 years ago

Regarding 2, nested txns do not work only in the pasted snippets - we need to test if txn is nested before waiting on ARE and do not wait if nested, and multi-threading is still there, we only do it ourselves before hitting the native lib. Will test with <int,int> to see if auto-reset event adds noticeable overhead for small pieces.

ilyalukyanov commented 9 years ago

Overflow can happen inside a nested transaction, so some handling should be done there as well. For instance, when we have a long living parent transaction and open many nested ones. It's even more complicated as we cannot resize closing only nested transaction, we need to close the root transaction and all its children.

I think this does disallow multi-threading capabilities. One of the greatest features of lmdb is how concurrent access is organized. These changes make all writes exclusive. One thread at a time. I dont't think this worth it.

On the other hand, for some applications this would be enough. For these apps I'd rather create an add-on/wrapper/extension-assembly that adds this functionality to the existing infrastructure (via extension methods, proxies, overriding etc). When a developer need it, he just adds it.

I still have doubts about this solution. Ideally the check and resizing (at least by estimates) should be done with every Put. Otherwise this solution doesn't cover the case when we open a transaction and simply put megabytes of data inside. Even if we did it with every put, it wouldn't be a clean solution as it doesn't work when we put a single BLOB, which is bigger than environment.

We cannot gracefully deal with natural limitation of lmdb because it's a low level problem. For many cases we know what is the maximum space required. If it's unlimited, probably lmdb isn't an appropriate tool in such cases.

KarolKolenda commented 9 years ago

First of all many thanks for fixing concurrency issues - that was amazingly fast. In terms of this request - I totally agree with your suggestion: new API. Current API is brilliant, low-level and follows original API to bits - and I believe it should stay that way. I personally require "growable" table but I just rolled out a basic method when the whole DB operations must be provided as repeatable callback. It is a client code responsibility to make sure that the callback will play-back all DB operations in exactly the same order and without side effects. The whole callback is executed within a single transaction (which is automatically committed unless client callback throws an exception) and if the error occurs (because of lack of space), the wrapper enlarges DB and executes client callback again:

lmdb.Write((tx)=>{
    tx.Put("key", "value");
    tx.Delete("key");
});

Still no idea how to handle nested transactions but I don't think it is a big deal in 90% of use cases. Just my five cents...

ilyalukyanov commented 9 years ago

@KarolKolenda Yeah, this looks reasonable. This is my favorite solution so far.

I mentioned here a complete implementation of this approach, which is in C - https://github.com/ilyalukyanov/Lightning.NET/issues/32#issuecomment-68165492

What they do is just support a reduced subset of the lmdb's API and do all handling inside of their wrapper. I think, if this will ever be implemented in Lightning.NET, I'd go exactly the same way. Probably, it will be a separate NuGet package.

KarolKolenda commented 9 years ago

@ilyalukyanov

I mentioned here a complete implementation of this approach, which is in C

I've just had a time to look at this simplified API and do not like it much. The biggest bummer for me is complete lack of transactions: or every "put" or "get" is executed in separate transactions or we have to use special "bulk" mode which should not really be used unless for large, database-locking batch upload.

I don't mind simplified API but "hiding" transactions completely not only reduces functionality but it will also have heavy impact on performance.

Then again everyone has different needs and to be honest I won't even dare to suggest that my usage scenario is the most common. Maybe a simple Dictionary API without transactions really is good enough for 90% of cases - but I am just weeping for performance lost of one "put" per transaction.

CoreyKaylor commented 9 years ago

I think if there were a seamless way to autogrow lmdb would do this itself. They have advised against it in other threads so I'm not sure why we would want this behavior for LightningDB either.

hyc commented 9 years ago

Ultimately it's a waste of time. Define a map as big as your free storage space and forget about it. Doing the work required to safely lock/grow/unlock across multiple processes will slow down all the normal cases as well, and doesn't actually save any disk space or memory. People hung up on maxsize are, to be blunt, clueless.

ilyalukyanov commented 9 years ago

I think it's time to close this one. Thanks to everyone who contributed to this conversation!

buybackoff commented 8 years ago

LMDB now supports file growing on Windows. See ITS#8324 and py-lmdb discussion about this. https://github.com/dw/py-lmdb/issues/85#issuecomment-163934475

I have tested this from C# and do not see any major slowdown. To insert 10M <int,int> v.0.9.14 takes minimum 3400 msec, latest master takes minimum 3750 msec. This is not scientific, just best result from 10 runs. Sometimes both times increase to 5000+ msecs. On average slowdown is visible but tolerable - from 2.9 Mops to 2.6 Mops (absolute numbers are still awesome). With Append and NoSync I could get 3.45 Mops on the same test with master build. Interestingly, with WriteMap performance of master drops 3x to 10000 msec or just 1 Mops, while for the v.0.9.14 WriteMap impoves performance to 2350 msec or 4.25 Mops.

The main gotcha is that MapSize must remain big, only the file size is growing. I am setting now 50x times of my free space and it just works.

CoreyKaylor / Lightning.NET

Database growing #32