Closed gsbelarus closed 1 year ago
How do you run gbak, what command line ? With 12 cores CPU and fast SSD I'd recommend to set ParallelWorkers to 8-12 and adjust MaxParallelWorkers accordingly. Don't change FileSystemCacheSize, it makes no sense. Leave it set to 0.
The command line:
gbak -r local_bk_file local_fdb_file -user SYSDBA -pas ... -se localhost/port:service_mgr
How about increasing TempCacheLimit to 16384M? For the first half of the process of indices creation I observed no temp sort files at all, then they started to appear, but with size 0.
Also, I observed decrease in the speed of read/write operations. The process starts with numbers like 25 MB/s for reading BK file and 35 MB/s for writing FDB file and then gradually slows down to 5 MB/s for reading and 8-10 MB/s for writing.
I will try again with parallel workers count modified and post here new data.
You should read the docs about -PARALLEL switch of gbak.
Uh, no... Who reads docs ))
By the way, why don't activate PARALLEL mode by default? What are cons?
How about increasing TempCacheLimit to 16384M? For the first half of the process of indices creation I observed no temp sort files at all, then they started to appear, but with size 0.
I can't recomend any value for TempCacheLimit as I have no idea of how big sort files could be. "Zero" size of temp sort files you observe is because lazy updating of FS metadata by OS.
Also, I observed decrease in the speed of read/write operations. The process starts with numbers like 25 MB/s for reading BK file and 35 MB/s for writing FDB file and then gradually slows down to 5 MB/s for reading and 8-10 MB/s for writing.
About reads - looks like buffer (and FS) cache works. Less writing at the end of process could be because of indices requires much less space than tables while CPU "cost per page" is higher.
By the way, why don't activate PARALLEL mode by default? What are cons?
New feature should be well tested by end users and enough feedback provided before we can make it default mode. Also, it is not easy to guess good default value for number of parallel workers in general case. With help of you and other customers we could change this before final release, of course ;)
I can't recomend any value for TempCacheLimit as I have no idea of how big sort files could be. "Zero" size of temp sort files you observe is because lazy updating of FS metadata by OS.
Is there any correlation between temp sort file size and index size I can see in the database statistics? Large indices have around 1 GB each in the database.
And regarding temp cache's increasing granularity. Does 1M means that for 1GB index there would be 1000 of buffer increases?
Is there any correlation between temp sort file size and index size I can see in the database statistics? Large indices have around 1 GB each in the database.
There's a correlation, although it's not 1:1, compression should be also taken into account. Sort temp file is usually larger than the index.
And regarding temp cache's increasing granularity. Does 1M means that for 1GB index there would be 1000 of buffer increases?
Not necessarily. 1MB is the lowest allocation chunk but if the sorting module will request e.g. 100MB of continuous space, it will be extended in one step.
Here we go!
6:35:00
With --parallel 8
switch and TempCacheLimit = 16384M
. No temp sort files were created during the restore process.
It is 56% against FB3. So, the declaration of two times increase in speed is quite correct.
Then I tried to check how the restored database would be archived with FB5 and got error:
gbak -b "....FDB" "....bk" -user sysdba -pas ... -se localhost/3056:service_mgr -parallel 8
gbak: ERROR:Error reading data from the connection.
gbak:Exiting before completion due to errors
In the firebird.log file there is a record:
SERVER Wed Nov 16 19:59:42 2022
INET/inet_error: read errno = 10054, server host = localhost, address = ::1/3056
Just two times faster with 8 workes ? Seems not much for me ;) What page cache value was set for DB ? Does you monitor CPU and IO load during restore ?
Then I tried to check how the restored database would be archived with FB5 and got error:
Looks like server crash :( Is it reproducible ? If yes, could you produce crash dump for me ?
Just two times faster with 8 workes ? Seems not much for me ;) What page cache value was set for DB ? Does you monitor CPU and IO load during restore ?
Overall CPU load was 70-80% through the whole process.
Write speed was mostly in 25-35 MB/s range.
Then I tried to check how the restored database would be archived with FB5 and got error:
Looks like server crash :( Is it reproducible ? If yes, could you produce crash dump for me ?
I'll repeat the error and create a dump file.
Set -par 16 switch. The process runs even faster, but again with error:
How to get dump? I have set flag in the registry:
[HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\Windows Error Reporting\LocalDumps]
'DumpType'=dword:00000002
But there is no dump file on the disk created?
Page size and other db file properties:
Database header page information:
Flags 0
Generation 12751
System Change Number 0
Page size 8192
ODS version 13.1
Oldest transaction 12727
Oldest active 12728
Oldest snapshot 12728
Next transaction 12761
Sequence number 0
Next attachment ID 51
Implementation HW=AMD/Intel/x64 little-endian OS=Windows CC=MSVC
Shadow count 0
Page buffers 2000
Next header page 0
Database dialect 3
Creation date Nov 16, 2022 11:57:55
Attributes force write
Variable header data:
Database GUID: {FCF9A099-D30D-427E-A80A-B76151EB086C}
Sweep interval: 0
*END*
How to get dump?
First, you must ensure that Firebird process was really crashed. To produce crash dumps on Windows you may follow this article:
https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps
or use SysInternals utility ProcDump, it have own instructions.
Page buffers 2000
It is too small for SS and very-very-very small for parallel operations. You may use -BUFFERS switch for restore to set another value. I recommend to start from 50000 at least.
Page buffers 2000
It is too small for SS and very-very-very small for parallel operations. You may use -BUFFERS switch for restore to set another value. I recommend to start from 50000 at least.
That is because Classic Server being used in production.
If it helps, now the next two records appear in log file after gbak has ended with error:
SERVER Wed Nov 16 21:29:51 2022
INET/inet_error: read errno = 10054, server host = localhost, address = ::1/3056
SERVER Wed Nov 16 21:29:51 2022
XNET error: XNET server initialization failed. Probably another instance of server is already running.
operating system directive CreateMutex failed
Nie mozna utworzyc pliku, ktory juz istnieje.
My settings for registering crash dumps:
gbak has ended with error again, but no dump has been created.
I even has installed debug version of latest FB 5.
are there any ideas how to ping that error?
Try set MaxStatementCacheSize = 0
config if you are having crash.
Page buffers 2000
It is too small for SS and very-very-very small for parallel operations. You may use -BUFFERS switch for restore to set another value. I recommend to start from 50000 at least.
I set -buffers 4_000_000 and -par 16 and got 6:43:00. Even worse than with a small buffer. Maybe four millions too much... I will try once more with 100 000.
Try set
MaxStatementCacheSize = 0
config if you are having crash.
It didn't help. Still getting:
gbak: ERROR:Error reading data from the connection.
gbak:Exiting before completion due to errors
at the end of backup process.
On 11/17/22 11:49, Andrej Kirejeŭ wrote:
Still getting:
|gbak: ERROR:Error reading data from the connection. gbak:Exiting before completion due to errors |
at the end of backup process.
What does gfix say about newly restored DB?
What does gfix say about newly restored DB?
I will check in 6 hours. Just started the restore process all over again with smaller buffer. Want to check if 6:30:00 record is breakable, or it is a limit for the system now.
So, I did four runs. Seems, that timing almost independent on workers count and buffer size in the parallel mode.
parallel workers | buffers | time | temp | db folder |
---|---|---|---|---|
1 | 2000 | 11:39:00 | NVMe | NVMe |
8 | 2000 | 6:35:00 | NVMe | NVMe |
10 | 100000 | 6:55:00 | NVMe | NVMe |
16 | 4000000 | 6:43:00 | NVMe | NVMe |
10 | 2000 | 11:00:00 | NVMe | HDD |
What does gfix say about newly restored DB?
gfix tells everything is ok.
If it helps, I ran gbak without -se
and -par
switches, and it has worked without error messages. So, it seems that -se
switch is to blame. Somehow, gbak looses the connection to the server at the very end of the restore process.
On 11/18/22 12:53, Andrej Kirejeŭ wrote:
If it helps, I ran gbak without |-se| and |-par| switches, and it has worked without error messages. So, it seems that |-se| switch is to blame. Somehow, gbak looses the connection to the server at the very end of the restore process.
Please provide both command lines: for failure & success.
failure:
gbak -b ...fdb ...bk -user SYSDBA -pas .... -se localhost/3056:service_mgr -par 10
success:
gbak -b localhost/3056:...fdb ...bk -user SYSDBA -pas ...
On 11/18/22 15:36, Andrej Kirejeŭ wrote:
failure:
|gbak -b ...fdb ...bk -user SYSDBA -pas .... -se localhost/3056:service_mgr -par 10 |
success:
|gbak -b localhost/3056:...fdb ...bk -user SYSDBA -pas ... |
Please try
gbak -b ...fdb ...bk -user SYSDBA -pas .... -se localhost/3056:service_mgr
i.e. w/o -PAR switch - I much doubt problem to be in -SE.
Without -par there is still error there.
0:25:03.02 C:\Program Files\FB5>gbak -b "....FDB" "....bk" -user sysdba -pas ... -se localhost/3056:service_mgr
gbak: ERROR:Error reading data from the connection.
gbak:Exiting before completion due to errors
So, the conclusion is that the presence of -se switch with or without -par switch leads to the error message at the end of the backup process. Without the -se switch, the process ends successfully.
On 11/20/22 20:06, Andrej Kirejeŭ wrote:
So, the conclusion is that the presence of -se switch with or without -par switch leads to the error message at the end of the backup process. Without the -se switch, the process ends successfully.
Can you provide stack trace for a crash?
If you tell me how to do it. Because, I doubt there is a crash at all. At least, no crash dump is being created. Seems that server closes socket earlier than gbak expect it to be closed.
After two days with no electricity, I'm trying to reproduce it, please be patient.
BTW, could you reproduce it with smaller database, i.e. not 300GB ? Also, could you describe exact steps to reproduce, including environment details such as number of FB instances running and its config's ?
two instances. FB3 on 3054 and FB5 on 3056. both firebird.conf files set to maximal compatibility with older version. i.e. LegacyAuth, LegacyUserManager etc.
I will try with smaller db. If it does matter this db has two warnings, two objects which uses ABS function from the standard UDF which isn't provided now by default.
Here is my results. First I restored largest backup file I have in hand - its size is 60GB, restored DB size is 86GB. Win 10, CPU i7 12700 (8 perf cores + 4 efficient cores), RAM 32 GB. Backup and temp on SATA SSD, DB on NVME SSD. Must note that I set affinity to performance cores only to get best results, it is important.
firebird.conf changes was
TempCacheLimit = 1G
CpuAffinityMask = 65535
MaxParallelWorkers = 64
workers | time, sec |
---|---|
1 | 3 819 |
4 | 1 706 |
8 | 1 310 |
16 | 1 251 |
32 | 1 322 |
Than I run backup of restored DB
workers | time, sec |
---|---|
1 | 1 220 |
4 | 434 |
8 | 267 |
16 | 220 |
32 | 222 |
looks huge. I will try again with CPU affinity mask set. By the way. I have 12 cores / 20 logical processors. What mask would be best? 2^12 - 1 or 2^ 20 - 1. I think, I should use only physical cores, not logical processors. But are they come first inthe affinity bit mask?
Can you elaborate? You have set CpuAffinityMask to 65535. This is first 15 bits set. How come that you use only performant cores if you have 12 cores in total? Or do you mean first 8 cores * 2 = 16 with hyperthreading turned on?
If it helps, now the next two records appear in log file after gbak has ended with error:
SERVER Wed Nov 16 21:29:51 2022 INET/inet_error: read errno = 10054, server host = localhost, address = ::1/3056 SERVER Wed Nov 16 21:29:51 2022 XNET error: XNET server initialization failed. Probably another instance of server is already running. operating system directive CreateMutex failed Nie mozna utworzyc pliku, ktory juz istnieje.
The second message could be put into log only when Firebird process is started. It is expected when you run two Firebird instances with the same IPCName - second instance can't create required named kernel objects as it was already created by the first one. So, if this message was put really after gbak finishes, it means Firebird process was crashed and restarted (by the service manager or guardian). And, if WER is tuned correctly, there must be memory dump somewhere. And there must be same message at the time Firebird was started initially, i.e. earlier.
So far it looks very confusing and I still have no all details necessary to explain the issue.
Can you elaborate? You have set CpuAffinityMask to 65535. This is first 15 bits set. How come that you use only performant cores if you have 12 cores in total? Or do you mean first 8 cores * 2 = 16 with hyperthreading turned on?
65535 is 16 bits, not 15 :) And, yes, it is 8 perf cores + 8 HT cores.
Any update on this issue? I believe the parallel performance is no longer a question, but I'm still worried about possible crashes mentioned.
I'll recheck with the latest version.
Any update on this issue? I believe the parallel performance is no longer a question, but I'm still worried about possible crashes mentioned.
Updated server to the latest build. Left security5.fdb
from previous try. Now, I'm getting an error at the very start of restore process:
LEPEL Fri Jan 27 23:43:06 2023
XNET error: XNET server initialization failed. Probably another instance of server is already running.
operating system directive CreateMutex failed
Nie mozna utworzyc pliku, ktory juz istnieje.
LEPEL Fri Jan 27 23:44:58 2023
Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)
LEPEL Fri Jan 27 23:44:58 2023
Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)
LEPEL Fri Jan 27 23:44:58 2023
Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)
LEPEL Fri Jan 27 23:44:58 2023
Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)
LEPEL Fri Jan 27 23:44:58 2023
Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)
LEPEL Fri Jan 27 23:44:58 2023
Database: C:\PROGRAM FILES\FB5\SECURITY5.FDB
internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)
LEPEL Fri Jan 27 23:44:58 2023
Authentication error
cannot attach to password database
Error in isc_attach_database() API call when working with legacy security database
internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 487)
The error remains with the latest FB build.
gbak: ERROR:Error reading data from the connection.
gbak:Exiting before completion due to errors
I have tried with -zip key and without it. The result is the same.
firebird log contains the following lines:
LEPEL Sat Jan 28 23:58:12 2023
INET/inet_error: read errno = 10054, server host = localhost, address = ::1/3056
LEPEL Sat Jan 28 23:58:12 2023
XNET error: XNET server initialization failed. Probably another instance of server is already running.
operating system directive CreateMutex failed
Nie mozna utworzyc pliku, ktory juz istnieje.
There is no another instance of FB 5 server. There are FB25 and FB3 but they are under different port numbers and use different service names.
About XNET error: please, read again my comment above and change IPCName
(or stop another FB instances)
I have stopped all FB instances besides FB5 and still get this error.
Well, I have even stopped FB 5 and try just bare gbak. Now it falls with:
gbak: ERROR:I/O error during "CreateFile (create)" operation for file ""
gbak: ERROR: Error while trying to create file
gbak: ERROR: The system cannot find the path specified.
gbak: ERROR:gds_$start_request failed
gbak:Exiting before completion due to errors
Afraid, I can do nothing without steps to reproduce error by gbak. Please, create new ticket with as much info as possible.
I consider the original issue with this ticket resolved.
I'm a bit of disappointed with FB 5 performance. Here are timings:
FB3 -- 11:44:00 FB5 -- 11:39:00
Size of files after restore:
FB3 -- 295,560,544,256 bytes FB5 -- 304,557,326,336 bytes
Both were set up as:
FB5 parallel processing was set as:
The system is Windows 2019 server with 128 GB RAM, 2 TB NVME drive, 12 cores Intel i7 processor.
So, performance is the same in both cases. No acceleration as was promised.