FirebirdSQL / firebird

Firebird server, client and tools
https://www.firebirdsql.org/
1.19k stars 205 forks source link

Is FB5 expected to be two times faster than FB3 at database restore? #7389

Closed gsbelarus closed 1 year ago

gsbelarus commented 1 year ago

I'm a bit of disappointed with FB 5 performance. Here are timings:

FB3 -- 11:44:00 FB5 -- 11:39:00

Size of files after restore:

FB3 -- 295,560,544,256 bytes FB5 -- 304,557,326,336 bytes

Both were set up as:

  1. Super server
  2. TempCacheLimit = 8192M
  3. FileSystemCacheSize = 70

FB5 parallel processing was set as:

  1. MaxParallelWorkers = 4
  2. ParallelWorkers = 2

The system is Windows 2019 server with 128 GB RAM, 2 TB NVME drive, 12 cores Intel i7 processor.

So, performance is the same in both cases. No acceleration as was promised.

hvlad commented 1 year ago

How do you run gbak, what command line ? With 12 cores CPU and fast SSD I'd recommend to set ParallelWorkers to 8-12 and adjust MaxParallelWorkers accordingly. Don't change FileSystemCacheSize, it makes no sense. Leave it set to 0.

gsbelarus commented 1 year ago

The command line:

gbak -r local_bk_file local_fdb_file -user SYSDBA -pas ... -se localhost/port:service_mgr

How about increasing TempCacheLimit to 16384M? For the first half of the process of indices creation I observed no temp sort files at all, then they started to appear, but with size 0.

Also, I observed decrease in the speed of read/write operations. The process starts with numbers like 25 MB/s for reading BK file and 35 MB/s for writing FDB file and then gradually slows down to 5 MB/s for reading and 8-10 MB/s for writing.

I will try again with parallel workers count modified and post here new data.

hvlad commented 1 year ago

You should read the docs about -PARALLEL switch of gbak.

gsbelarus commented 1 year ago

Uh, no... Who reads docs ))

By the way, why don't activate PARALLEL mode by default? What are cons?

hvlad commented 1 year ago

How about increasing TempCacheLimit to 16384M? For the first half of the process of indices creation I observed no temp sort files at all, then they started to appear, but with size 0.

I can't recomend any value for TempCacheLimit as I have no idea of how big sort files could be. "Zero" size of temp sort files you observe is because lazy updating of FS metadata by OS.

Also, I observed decrease in the speed of read/write operations. The process starts with numbers like 25 MB/s for reading BK file and 35 MB/s for writing FDB file and then gradually slows down to 5 MB/s for reading and 8-10 MB/s for writing.

About reads - looks like buffer (and FS) cache works. Less writing at the end of process could be because of indices requires much less space than tables while CPU "cost per page" is higher.

hvlad commented 1 year ago

By the way, why don't activate PARALLEL mode by default? What are cons?

New feature should be well tested by end users and enough feedback provided before we can make it default mode. Also, it is not easy to guess good default value for number of parallel workers in general case. With help of you and other customers we could change this before final release, of course ;)

gsbelarus commented 1 year ago

I can't recomend any value for TempCacheLimit as I have no idea of how big sort files could be. "Zero" size of temp sort files you observe is because lazy updating of FS metadata by OS.

Is there any correlation between temp sort file size and index size I can see in the database statistics? Large indices have around 1 GB each in the database.

And regarding temp cache's increasing granularity. Does 1M means that for 1GB index there would be 1000 of buffer increases?

dyemanov commented 1 year ago

Is there any correlation between temp sort file size and index size I can see in the database statistics? Large indices have around 1 GB each in the database.

There's a correlation, although it's not 1:1, compression should be also taken into account. Sort temp file is usually larger than the index.

And regarding temp cache's increasing granularity. Does 1M means that for 1GB index there would be 1000 of buffer increases?

Not necessarily. 1MB is the lowest allocation chunk but if the sorting module will request e.g. 100MB of continuous space, it will be extended in one step.

gsbelarus commented 1 year ago

Here we go!

6:35:00

With --parallel 8 switch and TempCacheLimit = 16384M. No temp sort files were created during the restore process.

It is 56% against FB3. So, the declaration of two times increase in speed is quite correct.

gsbelarus commented 1 year ago

Then I tried to check how the restored database would be archived with FB5 and got error:

gbak -b "....FDB" "....bk" -user sysdba -pas ... -se localhost/3056:service_mgr -parallel 8
gbak: ERROR:Error reading data from the connection.
gbak:Exiting before completion due to errors

In the firebird.log file there is a record:

SERVER  Wed Nov 16 19:59:42 2022
    INET/inet_error: read errno = 10054, server host = localhost, address = ::1/3056
hvlad commented 1 year ago

Just two times faster with 8 workes ? Seems not much for me ;) What page cache value was set for DB ? Does you monitor CPU and IO load during restore ?

Then I tried to check how the restored database would be archived with FB5 and got error:

Looks like server crash :( Is it reproducible ? If yes, could you produce crash dump for me ?

gsbelarus commented 1 year ago

Just two times faster with 8 workes ? Seems not much for me ;) What page cache value was set for DB ? Does you monitor CPU and IO load during restore ?

Overall CPU load was 70-80% through the whole process.

Write speed was mostly in 25-35 MB/s range.

Then I tried to check how the restored database would be archived with FB5 and got error:

Looks like server crash :( Is it reproducible ? If yes, could you produce crash dump for me ?

I'll repeat the error and create a dump file.

gsbelarus commented 1 year ago

Set -par 16 switch. The process runs even faster, but again with error:

How to get dump? I have set flag in the registry:

[HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\Windows Error Reporting\LocalDumps] 
'DumpType'=dword:00000002

But there is no dump file on the disk created?

gsbelarus commented 1 year ago

Page size and other db file properties:

Database header page information:
        Flags                   0
        Generation              12751
        System Change Number    0
        Page size               8192
        ODS version             13.1
        Oldest transaction      12727
        Oldest active           12728
        Oldest snapshot         12728
        Next transaction        12761
        Sequence number         0
        Next attachment ID      51
        Implementation          HW=AMD/Intel/x64 little-endian OS=Windows CC=MSVC
        Shadow count            0
        Page buffers            2000
        Next header page        0
        Database dialect        3
        Creation date           Nov 16, 2022 11:57:55
        Attributes              force write

    Variable header data:
        Database GUID:  {FCF9A099-D30D-427E-A80A-B76151EB086C}
        Sweep interval:         0
        *END*
hvlad commented 1 year ago

How to get dump?

First, you must ensure that Firebird process was really crashed. To produce crash dumps on Windows you may follow this article:

https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps

or use SysInternals utility ProcDump, it have own instructions.

hvlad commented 1 year ago

Page buffers 2000

It is too small for SS and very-very-very small for parallel operations. You may use -BUFFERS switch for restore to set another value. I recommend to start from 50000 at least.

gsbelarus commented 1 year ago

Page buffers 2000

It is too small for SS and very-very-very small for parallel operations. You may use -BUFFERS switch for restore to set another value. I recommend to start from 50000 at least.

That is because Classic Server being used in production.

gsbelarus commented 1 year ago

If it helps, now the next two records appear in log file after gbak has ended with error:

SERVER  Wed Nov 16 21:29:51 2022
    INET/inet_error: read errno = 10054, server host = localhost, address = ::1/3056

SERVER  Wed Nov 16 21:29:51 2022
    XNET error: XNET server initialization failed. Probably another instance of server is already running.
    operating system directive CreateMutex failed
    Nie mozna utworzyc pliku, ktory juz istnieje. 
gsbelarus commented 1 year ago

My settings for registering crash dumps:

image

gbak has ended with error again, but no dump has been created.

I even has installed debug version of latest FB 5.

gsbelarus commented 1 year ago

are there any ideas how to ping that error?

asfernandes commented 1 year ago

Try set MaxStatementCacheSize = 0 config if you are having crash.

gsbelarus commented 1 year ago

Page buffers 2000

It is too small for SS and very-very-very small for parallel operations. You may use -BUFFERS switch for restore to set another value. I recommend to start from 50000 at least.

I set -buffers 4_000_000 and -par 16 and got 6:43:00. Even worse than with a small buffer. Maybe four millions too much... I will try once more with 100 000.

gsbelarus commented 1 year ago

Try set MaxStatementCacheSize = 0 config if you are having crash.

It didn't help. Still getting:

gbak: ERROR:Error reading data from the connection.
gbak:Exiting before completion due to errors

at the end of backup process.

AlexPeshkoff commented 1 year ago

On 11/17/22 11:49, Andrej Kirejeŭ wrote:

Still getting:

|gbak: ERROR:Error reading data from the connection. gbak:Exiting before completion due to errors |

at the end of backup process.

What does gfix say about newly restored DB?

gsbelarus commented 1 year ago

What does gfix say about newly restored DB?

I will check in 6 hours. Just started the restore process all over again with smaller buffer. Want to check if 6:30:00 record is breakable, or it is a limit for the system now.

gsbelarus commented 1 year ago

So, I did four runs. Seems, that timing almost independent on workers count and buffer size in the parallel mode.

parallel workers buffers time temp db folder
1 2000 11:39:00 NVMe NVMe
8 2000 6:35:00 NVMe NVMe
10 100000 6:55:00 NVMe NVMe
16 4000000 6:43:00 NVMe NVMe
10 2000 11:00:00 NVMe HDD
gsbelarus commented 1 year ago

What does gfix say about newly restored DB?

gfix tells everything is ok.

gsbelarus commented 1 year ago

If it helps, I ran gbak without -se and -par switches, and it has worked without error messages. So, it seems that -se switch is to blame. Somehow, gbak looses the connection to the server at the very end of the restore process.

AlexPeshkoff commented 1 year ago

On 11/18/22 12:53, Andrej Kirejeŭ wrote:

If it helps, I ran gbak without |-se| and |-par| switches, and it has worked without error messages. So, it seems that |-se| switch is to blame. Somehow, gbak looses the connection to the server at the very end of the restore process.

Please provide both command lines: for failure & success.

gsbelarus commented 1 year ago

failure:

gbak -b ...fdb ...bk -user SYSDBA -pas .... -se localhost/3056:service_mgr -par 10

success:

gbak -b localhost/3056:...fdb ...bk -user SYSDBA -pas ...
AlexPeshkoff commented 1 year ago

On 11/18/22 15:36, Andrej Kirejeŭ wrote:

failure:

|gbak -b ...fdb ...bk -user SYSDBA -pas .... -se localhost/3056:service_mgr -par 10 |

success:

|gbak -b localhost/3056:...fdb ...bk -user SYSDBA -pas ... |

Please try

gbak -b ...fdb ...bk -user SYSDBA -pas .... -se localhost/3056:service_mgr

i.e. w/o -PAR switch - I much doubt problem to be in -SE.

gsbelarus commented 1 year ago

Without -par there is still error there.

 0:25:03.02 C:\Program Files\FB5>gbak -b "....FDB" "....bk" -user sysdba -pas ... -se localhost/3056:service_mgr
gbak: ERROR:Error reading data from the connection.
gbak:Exiting before completion due to errors
gsbelarus commented 1 year ago

So, the conclusion is that the presence of -se switch with or without -par switch leads to the error message at the end of the backup process. Without the -se switch, the process ends successfully.

AlexPeshkoff commented 1 year ago

On 11/20/22 20:06, Andrej Kirejeŭ wrote:

So, the conclusion is that the presence of -se switch with or without -par switch leads to the error message at the end of the backup process. Without the -se switch, the process ends successfully.

Can you provide stack trace for a crash?

gsbelarus commented 1 year ago

If you tell me how to do it. Because, I doubt there is a crash at all. At least, no crash dump is being created. Seems that server closes socket earlier than gbak expect it to be closed.

hvlad commented 1 year ago

After two days with no electricity, I'm trying to reproduce it, please be patient.

BTW, could you reproduce it with smaller database, i.e. not 300GB ? Also, could you describe exact steps to reproduce, including environment details such as number of FB instances running and its config's ?

gsbelarus commented 1 year ago

two instances. FB3 on 3054 and FB5 on 3056. both firebird.conf files set to maximal compatibility with older version. i.e. LegacyAuth, LegacyUserManager etc.

I will try with smaller db. If it does matter this db has two warnings, two objects which uses ABS function from the standard UDF which isn't provided now by default.

hvlad commented 1 year ago

Here is my results. First I restored largest backup file I have in hand - its size is 60GB, restored DB size is 86GB. Win 10, CPU i7 12700 (8 perf cores + 4 efficient cores), RAM 32 GB. Backup and temp on SATA SSD, DB on NVME SSD. Must note that I set affinity to performance cores only to get best results, it is important.

firebird.conf changes was

TempCacheLimit = 1G
CpuAffinityMask = 65535
MaxParallelWorkers = 64
workers time, sec
1 3 819
4 1 706
8 1 310
16 1 251
32 1 322

Than I run backup of restored DB

workers time, sec
1 1 220
4 434
8 267
16 220
32 222
gsbelarus commented 1 year ago

looks huge. I will try again with CPU affinity mask set. By the way. I have 12 cores / 20 logical processors. What mask would be best? 2^12 - 1 or 2^ 20 - 1. I think, I should use only physical cores, not logical processors. But are they come first inthe affinity bit mask?

gsbelarus commented 1 year ago

Can you elaborate? You have set CpuAffinityMask to 65535. This is first 15 bits set. How come that you use only performant cores if you have 12 cores in total? Or do you mean first 8 cores * 2 = 16 with hyperthreading turned on?

hvlad commented 1 year ago

If it helps, now the next two records appear in log file after gbak has ended with error:

SERVER    Wed Nov 16 21:29:51 2022
  INET/inet_error: read errno = 10054, server host = localhost, address = ::1/3056

SERVER    Wed Nov 16 21:29:51 2022
  XNET error: XNET server initialization failed. Probably another instance of server is already running.
  operating system directive CreateMutex failed
  Nie mozna utworzyc pliku, ktory juz istnieje. 

The second message could be put into log only when Firebird process is started. It is expected when you run two Firebird instances with the same IPCName - second instance can't create required named kernel objects as it was already created by the first one. So, if this message was put really after gbak finishes, it means Firebird process was crashed and restarted (by the service manager or guardian). And, if WER is tuned correctly, there must be memory dump somewhere. And there must be same message at the time Firebird was started initially, i.e. earlier.

So far it looks very confusing and I still have no all details necessary to explain the issue.

hvlad commented 1 year ago

Can you elaborate? You have set CpuAffinityMask to 65535. This is first 15 bits set. How come that you use only performant cores if you have 12 cores in total? Or do you mean first 8 cores * 2 = 16 with hyperthreading turned on?

65535 is 16 bits, not 15 :) And, yes, it is 8 perf cores + 8 HT cores.

dyemanov commented 1 year ago

Any update on this issue? I believe the parallel performance is no longer a question, but I'm still worried about possible crashes mentioned.

gsbelarus commented 1 year ago

I'll recheck with the latest version.

gsbelarus commented 1 year ago

Any update on this issue? I believe the parallel performance is no longer a question, but I'm still worried about possible crashes mentioned.

Updated server to the latest build. Left security5.fdb from previous try. Now, I'm getting an error at the very start of restore process:

LEPEL   Fri Jan 27 23:43:06 2023
    XNET error: XNET server initialization failed. Probably another instance of server is already running.
    operating system directive CreateMutex failed
    Nie mozna utworzyc pliku, ktory juz istnieje. 

LEPEL   Fri Jan 27 23:44:58 2023
    Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
    internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)

LEPEL   Fri Jan 27 23:44:58 2023
    Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
    internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)

LEPEL   Fri Jan 27 23:44:58 2023
    Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
    internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)

LEPEL   Fri Jan 27 23:44:58 2023
    Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
    internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)

LEPEL   Fri Jan 27 23:44:58 2023
    Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
    internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)

LEPEL   Fri Jan 27 23:44:58 2023
    Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
    internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)

LEPEL   Fri Jan 27 23:44:58 2023
    Authentication error
    cannot attach to password database
    Error in isc_attach_database() API call when working with legacy security database
    internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)
gsbelarus commented 1 year ago

The error remains with the latest FB build.

gbak: ERROR:Error reading data from the connection.
gbak:Exiting before completion due to errors

I have tried with -zip key and without it. The result is the same.

firebird log contains the following lines:

LEPEL   Sat Jan 28 23:58:12 2023
    INET/inet_error: read errno = 10054, server host = localhost, address = ::1/3056

LEPEL   Sat Jan 28 23:58:12 2023
    XNET error: XNET server initialization failed. Probably another instance of server is already running.
    operating system directive CreateMutex failed
    Nie mozna utworzyc pliku, ktory juz istnieje. 

There is no another instance of FB 5 server. There are FB25 and FB3 but they are under different port numbers and use different service names.

hvlad commented 1 year ago

About XNET error: please, read again my comment above and change IPCName (or stop another FB instances)

gsbelarus commented 1 year ago

I have stopped all FB instances besides FB5 and still get this error.

gsbelarus commented 1 year ago

Well, I have even stopped FB 5 and try just bare gbak. Now it falls with:

gbak: ERROR:I/O error during "CreateFile (create)" operation for file ""
gbak: ERROR:    Error while trying to create file
gbak: ERROR:    The system cannot find the path specified.
gbak: ERROR:gds_$start_request failed
gbak:Exiting before completion due to errors
hvlad commented 1 year ago

Afraid, I can do nothing without steps to reproduce error by gbak. Please, create new ticket with as much info as possible.

I consider the original issue with this ticket resolved.