erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.3k stars 2.94k forks source link

ERL-1360: Parallelise cold starting Mnesia #4367

Closed OTP-Maintainer closed 1 year ago

OTP-Maintainer commented 3 years ago

Original reporter: vans163 Affected version: OTP-23.1 Component: mnesia Migrated from: https://bugs.erlang.org/browse/ERL-1360


Coldbooting a large database on a 32/64 core ryzen.  Only 1 core is used and the speed is about 100MBs of data loaded into mem every 2 seconds, the disk storage is NVME drive.

!image-2020-09-24-10-05-49-385.png!

I think it should be possible to parallelise this somehow? 
OTP-Maintainer commented 3 years ago

dgud said:

You can try the configuration parameter "-mnesia no_table_loaders NUMBER"

Is this a single node system or a multi node system, it multi node system it might have to wait for the other nodes,
it is really hard to say what is happening from a picture.

Though before tables are loaded the log is processed which must be done in order (sequentially), 
you can try to reduce that time by keeping the log file small, there are config parameters for that too.
OTP-Maintainer commented 3 years ago

vans163 said:

Single node system.

"it is really hard to say what is happening from a picture." I can grab some extra parameters if theres a hint of what to measure.

no_table_loaders I think wont work because its just 1 giant table.

Basically this.
{code:java}
    :mnesia.create_schema([:erlang.node()])
    :application.ensure_all_started(:mnesia)
    Db.create_table(Order)
    :mnesia.wait_for_tables(:mnesia.system_info(:local_tables), :infinity)
{code}
The Order table has 119601439 objects.
{code:java}
    :mnesia.create_table(Order,
      disc_copies: [node()],
      type: :ordered_set,
      attributes: [:uuid, :data]
    )
{code}
I don't think its the log because this table gets a lot of writes.  The amount of time it take to cold boot grows linearly to the size of the table (not to the 3 minute / 1000 write default log flush threshold).  I think its just the time its taking to read each object from disk, convert it to erlang term, allocating mem in the process.
OTP-Maintainer commented 3 years ago

dgud said:

If it is only one table, I don't think I can do much.

{quote}
  I don't think its the log because this table gets a lot of writes.
{quote}
Which is why you would get the log files. 
But you can see that easily if you check the file sizes in mnesia directory before you start.
OTP-Maintainer commented 3 years ago

mikael pettersson said:

Large disc_copies tables are very difficult to optimize, because mnesia pretty much _has_ to scan them sequentially during start-up.  You could parallelize the binary_to_term and ets insertions during the DCD loading, but that will cost in higher synchronization overheads.  And the DCL too needs to be scanned sequentially.

If you can, you may want to partition that table.

You also pay in "dump" times which can occur at inopportune moments.

Our solution was to migrate to our large disc_copies tables to leveldb.  It isn't trivial (depending on how they're used), but for us it was a life-saver.  YMMV.
OTP-Maintainer commented 3 years ago

mikael pettersson said:

Another thing that can help is to pre-allocate RAM for the VM using supercarriers.
OTP-Maintainer commented 3 years ago

vans163 said:

The DCD is 10GB and the DCL is ~1GB at most. So the table itself is quite small.
{code:java}
because mnesia pretty much has to scan them sequentially during start-up.
{code}
I looked at the DCD format, it is pretty much a file header + file metadata like the node name.   The strange thing is the table object count is not present, I would expect that.  Following is a non-delimited sequence of Erlang terms, which does indeed give the impression that it has to be parsed in sequence since we have no idea on the size of each term.

So some things that come to mind then on changing the DCD format, if each Erlang term was preceded by its size, say as a varint, it would be simple to break up the file into chunks.  Just read sequential chunks of the file at a time, then each term has a delimiter, binary_part each term out and push it to another pid for the binary_to_term conversion + insertion, since all mnesia tables are hashtable based insertion order does not matter. + the OTP 22 optimisations for insertions on ordered_set seem to indicate a 64 thread box can achieve 20m inserts per second.

Another idea that comes to mind without changing the DCD format is then, if we assume each mnesia record MUST be a tuple with the first element being the table name.  We can split up the DCD when the first term starts into equal chunks == to the schedulers online.  On each pid scan forwards until the first Erlang terms succeeds to parse where its a tuple with the first element being the table name, scan backwards to make sure its not wrapped by another term.  Once each boundary is determined, synchronise that to the worker pids and let each one read in parallel up to their next delimiter.
{code:java}
Another thing that can help is to pre-allocate RAM for the VM using supercarriers.
{code}
I haven't had much success with this on Linux (/w hugepages), it seems to be only on FreeBSD that these supercarriers can be actually bound to hugepages/superpages.
dszoboszlay commented 3 years ago

@vans163 I tried once to make Mnesia loader parallel, but the speed up didn't seem significant enough. Nevertheless, you can try it with your DB, I rebased the code on master and pushed to parallel-mnesia_log.

The DCD is easy to process with parallel workers, the big issue is with the DCL, because it contains operations that need to be applied sequentially.

By the way, my branch is not PR-ready, it misses tests etc. Actually I only checked that it sill compiles after forward porting it from OTP-18, so there are no guarantees that it would work at all. Sorry!

dgud commented 1 year ago

Will not do anything about this for now.

vans163 commented 1 year ago

@vans163 I tried once to make Mnesia loader parallel, but the speed up didn't seem significant enough. Nevertheless, you can try it with your DB, I rebased the code on master and pushed to parallel-mnesia_log.

The DCD is easy to process with parallel workers, the big issue is with the DCL, because it contains operations that need to be applied sequentially.

By the way, my branch is not PR-ready, it misses tests etc. Actually I only checked that it sill compiles after forward porting it from OTP-18, so there are no guarantees that it would work at all. Sorry!

We just dropped mnesia a while back and rolled our own rocksdb+ets hybrid implementation. Fixed the problem lol.