FirebirdSQL / firebird

Firebird server, client and tools
https://www.firebirdsql.org/
1.19k stars 205 forks source link

Error: database file appears corrupted after restore from backup (FB5, RC2) #7942

Closed gsbelarus closed 1 month ago

gsbelarus commented 5 months ago

There are multiple Firebird instances on the server: FB25, FB3, and FB5. Each instance is assigned a dedicated port and an appropriate service name. When using a third-party application to perform a restore, it connects to the server via a TCP connection using the connection string localhost/3056:some_path_to_database.

At the end of the restoration process, the following error message appears:

Unable to complete network request to host "localhost". Error reading data from the connection.

Additionally, there is a record in the firebird.log file:

XNET error: XNET server initialization failed. Probably, another instance of the server is already running.

The resulting database file appears to be corrupted, and subsequent gfix -v -full ... shows:

Number of record level errors : 18722

sim1984 commented 5 months ago

XNET error: XNET server initialization failed. Probably, another instance of the server is already running.

To simultaneously run several servers using XNET, you need to configure the IpcName (firebird.conf) in the same way as you configure ports for different FB instances to work via INET.

gsbelarus commented 5 months ago

I used the connection string localhost/3056:path_to_database. Through all my life, I believed this is network connection. Am I wrong?

gsbelarus commented 5 months ago

Moreover, I checked the firebird.conf. There is no section dedicated to XNET settings exists. Where and what should I assign?

AlexPeshkoff commented 5 months ago

It's IPCNAME parameter that makes it possible to run multiple servers, listening xnet, in parallel. But it's related only to message in the log (XNET error: XNET server initialization failed. Probably, another instance of the server is already running.). What you mention first looks like server crash at the first look. Does it happen for all versions or only one particular?

gsbelarus commented 5 months ago

It's IPCNAME parameter that makes it possible to run multiple servers, listening xnet, in parallel. But it's related only to message in the log (XNET error: XNET server initialization failed. Probably, another instance of the server is already running.). What you mention first looks like server crash at the first look. Does it happen for all versions or only one particular?

I got another backup and now doing restoration. Will come with the result in a few hours.

As for XNET, is it enough if I set IPCNAME to FIREBIRD5 for Firebird 5?

AlexPeshkoff commented 5 months ago

On 1/2/24 19:57, Andrej Kirejeŭ wrote:

As for XNET, is it enough if I set IPCNAME to FIREBIRD5 for Firebird 5?

Enough to avoid mentioned message in the log.

gsbelarus commented 5 months ago

On 1/2/24 19:57, Andrej Kirejeŭ wrote: As for XNET, is it enough if I set IPCNAME to FIREBIRD5 for Firebird 5? Enough to avoid mentioned message in the log.

as it is a common scenario when multiple versions of Firebird co-exists on the same server, it would be useful to set the default IPCNAME to FIREBIRD5 for Firebird v.5, FIREBIRD6 for the next version etc

gsbelarus commented 5 months ago

sorry, bug confirmed. another backup -- the same result:

Unable to complete network request to host "localhost". Error reading data from the connection.

the message during the restore process and then:

Number of record level errors : 18722

during the gfix check.

hvlad commented 5 months ago

There are multiple Firebird instances on the server: FB25, FB3, and FB5. Each instance is assigned a dedicated port and an appropriate service name. When using a third-party application to perform a restore, it connects to the server via a TCP connection using the connection string localhost/3056:some_path_to_database.

What server version is supposed to perform restore ? Did that application run restore using Services API ? If, yes, how it is attached to the services manager and why it is used "localhost... " in target database name ? What happens if perform restore with gbak (or fbsvcmgr) using the same set of params ?

gsbelarus commented 5 months ago

What server version is supposed to perform restore ?

FB5 5.0.0.1304

Did that application run restore using Services API ?

yes

If, yes, how it is attached to the services manager

through IBX component:

https://github.com/GoldenSoftwareLtd/gedemin/blob/master/Gedemin/IBX/IBServices.pas

it has been doing this way for last 25 years ))

and why it is used "localhost... " in target database name ?

The Presence of localhost in the address string forced IBX to use TCP protocol flag in its components. We definitely needed it in the times of Yaffil, Firebird 0.9 or even earlier.

BTW, localhost/3056 allows us to call exactly FB5 server. As we have FB3, FB25 also running on this machine we could distinguish between them only by port number.

What happens if perform restore with gbak (or fbsvcmgr) using the same set of params ?

Need to check.

hvlad commented 5 months ago

When run restore with Services API, one should use server name in connection string, not in the target database name.

When speak about IBX, it means that server (remote host) name should be specified in ServerName property, not in DatabaseName property.

In your case localhost/3056:service_mgr is connection string and some_path_to_database (without server host name!) is target database name. If one use plain 'service_mgr' (empty string at ServerName property) it means connection uses XNET or, if it was not successful, fbclient will try 'localhost' without port number. As we have no idea what server version was started first and run XNET listener, we could assume it was not FB5 (message in firebird.log confirms it), I.e. another server instance run restore service and connects to the FB5 due to remote part in database name argument (DatabaseName property). Of course, it doesn't make whole restore process as fast as it should be.

In short: you should check and correct values of ServerName and DatabaseName properties.

BTW, what is last message before error ?

gsbelarus commented 5 months ago

This is how our UI organized. we specify the target server and path to the database in one string, then divide it programmatically and set the corresponding IBX component's properties. The connection definitely goes to the FB5 server because only FB5 process shows activity during the restore process.

Before the error message, there are a couple of warnings of unknown UDF functions. These are our functions like BIN_AND and BIN_OR. They duplicate built-in functions. There is no need for them right now, just leftovers in the database we didn't clean up yet. Nevertheless, these warnings were present for years and never resulted into database corruption.

hvlad commented 5 months ago

Could you be more specific and show service properties used ? There is some confusion that better to be cleared up. Also, did you investigated if server was crashed (as Alex suppose) ?

gsbelarus commented 5 months ago
      IBConfigService.ServerName := edServer.Text;

      if IBConfigService.ServerName > '' then
        IBConfigService.Protocol := TCP
      else
        IBConfigService.Protocol := Local;

in our case edServer.text contains localhost/3056

It would be easier to check whether the server crashed or not if appropriate records of server starting and server properly shutting down were put into the firebird.log file.

mrotteveel commented 5 months ago

On 02/01/2024 19:24, Andrej Kirejeŭ wrote:

On 1/2/24 19:57, Andrej Kirejeŭ wrote: As for XNET, is it enough if
I set IPCNAME to FIREBIRD5 for Firebird 5?
Enough to avoid mentioned message in the log.

as it is a common scenario when multiple versions of Firebird co-exists on the same server, it would be useful to set the default IPCNAME to FIREBIRD5 for Firebird v.5, FIREBIRD6 for the next version etc

No, it wouldn't be, because that would mean an older Firebird fbclient.dll wouldn't be able to connect using XNET to a newer Firebird server (or newer client to older server).

Mark -- Mark Rotteveel

gsbelarus commented 5 months ago

No, it wouldn't be, because that would mean an older Firebird fbclient.dll wouldn't be able to connect using XNET to a newer Firebird server (or newer client to older server). Mark -- Mark Rotteveel

Oh, no. I just changed names to FIREBIRD25, FIREBIRD3, FIREBIRD5... Let us see where it leads now.

But, it won't affect network connections, right?

gsbelarus commented 5 months ago

And regarding failed gfix. The firebird.log now contains thousands of records like:

XXXX Tue Jan 2 13:16:27 2024 Database: XXXX Error: Record 3 is wrong length in table GD_EMPLOYEE (1394)

All errors related to one table in the database.

mrotteveel commented 5 months ago

Oh, no. I just changed names to FIREBIRD25, FIREBIRD3, FIREBIRD5... Let us see where it leads now.

But, it won't affect network connections, right?

No, only XNET. Keep in mind that for a client to now be able to connect to the server with XNET, the firebird.conf of the client library (i.e. located in the same directory as fbclient.dll used by your application) must also contain the right IpcName setting.

hvlad commented 5 months ago
      IBConfigService.ServerName := edServer.Text;

      if IBConfigService.ServerName > '' then
        IBConfigService.Protocol := TCP
      else
        IBConfigService.Protocol := Local;

in our case edServer.text contains localhost/3056

Thanks, but...how IBConfigService is related with restore task ? Please, don't add more confusion than we already have. Also, we need to know value of DatabaseName property (at least).

It would be easier to check whether the server crashed or not if appropriate records of server starting and server properly shutting down were put into the firebird.log file.

If Firebird run as a service, look into Windows Event Log. Also, it is good idea to always have WER turned ON to collect crash dumps.

sim1984 commented 5 months ago

And regarding failed gfix. The firebird.log now contains thousands of records like:

XXXX Tue Jan 2 13:16:27 2024 Database: XXXX Error: Record 3 is wrong length in table GD_EMPLOYEE (1394)

All errors related to one table in the database.

If restore was not completed like yours, then it is not surprising that gfix shows errors. Here you should rather look at the errors in firebird.log and the log of restore itself (gbak -v -y <log_file>).

You are restoring the database, but where are you restoring it from? What ODS was the backup made from?

gsbelarus commented 5 months ago

The database was backed up on FB 3 server and restored on FB 5.

aafemt commented 5 months ago

Is restore with gbak ok? Can you share core dumps of crashed server?

gsbelarus commented 5 months ago

There are no suspicious records in the Windows's logs. So, I assume that FB5 service didn't crash.

aafemt commented 5 months ago

Do error happen when all servers but v5 are stopped?

gsbelarus commented 5 months ago

Do error happen when all servers but v5 are stopped?

cann't check right now. those servers being used.

gsbelarus commented 5 months ago

well, I'm trying to run from the command prompt and the command just hangs. no CPU activity, no records in the firebird.log file.

C:\Program Files\FB5>gbak -r "K:\Bases\xxx.bk" "g:\Bases\xxx.fdb" -user sysdba -pas xxx 
-v -y "G:\Bases\Broiler\restore.log" -z
aafemt commented 5 months ago

At this point you can attach debugger to it to see what's happening.

hvlad commented 5 months ago

well, I'm trying to run from the command prompt and the command just hangs. no CPU activity, no records in the firebird.log file.

C:\Program Files\FB5>gbak -r "K:\Bases\xxx.bk" "g:\Bases\xxx.fdb" -user sysdba -pas xxx -v -y "G:\Bases\Broiler\restore.log" -z

This command line not uses services and thus not equal to the app case. It should run in embedded mode, did you check CPU usage of whole system or of Firebird process only ? In second case, it is expected to not use CPU.

Could you run quick check with metadata-only restore using services, instead ? Like:

gbak -se localhost/3056:service_mgr -r -m -v "K:\Bases\xxx.bk" "g:\Bases\xxx.fdb"

gsbelarus commented 5 months ago

restoring of metadata goes without any problems:

gbak:gbak version WI-V5.0.0.1304 Firebird 5.0 RC 2
gbak:use up to 8 parallel workers
gbak:transportable backup -- data in XDR format
gbak:       backup file is compressed
gbak:backup version is 10
gbak:created database g:\Bases\xxx.fdb, page_size 8192 bytes
gbak:started transaction
...
gbak:adjusting views dbkey length
gbak:updating ownership of packages, procedures and tables
gbak:adding missing privileges
gbak:adjusting system generators
gbak: WARNING:function ABS is not defined
gbak: WARNING:    module name or entrypoint could not be found
gbak: WARNING:    function BIN_AND is not defined
gbak: WARNING:    module name or entrypoint could not be found
gbak:finishing, closing, and going home
gbak:adjusting the ONLINE and FORCED WRITES flags
gsbelarus commented 5 months ago

now, I will start restoring using gbak utility with -se switch.

gsbelarus commented 5 months ago

It should run in embedded mode, did you check CPU usage of whole system or of Firebird process only ? In second case, it is expected to not use CPU.

whole system usage was around 1%. 10 minutes of no activity. I canceled it with Ctrl-C.

hvlad commented 5 months ago

It should run in embedded mode, did you check CPU usage of whole system or of Firebird process only ? In second case, it is expected to not use CPU.

whole system usage was around 1%. 10 minutes of no activity. I canceled it with Ctrl-C.

And what was at "restore.log" ?

gsbelarus commented 5 months ago

It should run in embedded mode, did you check CPU usage of whole system or of Firebird process only ? In second case, it is expected to not use CPU.

whole system usage was around 1%. 10 minutes of no activity. I canceled it with Ctrl-C.

And what was at "restore.log" ?

no restore.log, no database file was created on the disk. The utility just hangs and was hanging for all ten minutes.

aafemt commented 5 months ago

Does it do the same if you try to restore to a different v5 server?

gsbelarus commented 4 months ago

I downloaded the latest build 1305 with debug symbols, and, run it twice. The first time with procdump attached to the service and the second one -- without procdump. In both cases, the process of restoring ended without any error messages and the server didn't crash. But anyway, the subsequent run of gfix utility shows that the newly restored database file is corrupted:

Summary of validation errors

        Number of record level errors   : 18722
gsbelarus commented 4 months ago

BTW, gfix significantly pollutes firebird.log file. Im my case, every run adds 18 722 records and makes existing log file practically unusable. I would suggest that gfix stops adding records from the single run after a reasonable amount. Say, add at max 10 records, and then write into the firebird.log file message "please, use -v -y switch to store complete log in a separate file".

gsbelarus commented 2 months ago

The error continues to occur in FB 5.0.1. For us, it is a great blocker as we stopped all works on upgrading customer's servers until resolving the situation.

 1:53:33,42 C:\Program Files\FB5>gbak -r g:\XXXXXXX.bk g:\XXXXXXXX.fdb -user sysdba -pas XXXXXXXX -se localhost/3056:service_mgr -v > g:\XXXXXXXXX.txt
gbak: ERROR:Error reading data from the connection.
gbak:Exiting before completion due to errors
hvlad commented 2 months ago

Provide reproducible test case, please

hvlad commented 2 months ago

BTW, what is the first 10 and last 10 lines at g:\XXXXXXXXX.txt ?

gsbelarus commented 2 months ago

@hvlad

gbak:opened file g:\XXXXXXXXXX.bk 
gbak:use up to 8 parallel workers 
gbak:transportable backup -- data in XDR format 
gbak:       backup file is compressed 
gbak:backup version is 10 
gbak:created database g:\XXXXXXXXXX.fdb, page_size 8192 bytes 
gbak:started transaction 
gbak:restoring domain DINTKEY 
gbak:restoring domain DDBKEY 
gbak:restoring domain DFOREIGNKEY 

...

gbak:    activating and creating deferred index AT_FK_GENERATORS_EDITORKEY 
gbak:    activating and creating deferred index RDB$FOREIGN89 
gbak:adjusting views dbkey length 
gbak:updating ownership of packages, procedures and tables 
gbak:adding missing privileges 
gbak:adjusting system generators 
gbak: WARNING:function ABS is not defined 
gbak: WARNING:    module name or entrypoint could not be found 
gbak: WARNING:    function BIN_AND is not defined 
gbak: WARNING:    module name or entrypoint could not be found 
gbak:finishing, closing, and going home 

image

hvlad commented 2 months ago

I downloaded the latest build 1305 with debug symbols, and, run it twice. The first time with procdump attached to the service and the second one -- without procdump. In both cases, the process of restoring ended without any error messages and the server didn't crash. But anyway, the subsequent run of gfix utility shows that the newly restored database file is corrupted:

Summary of validation errors

        Number of record level errors   : 18722

Could you show corresponding messages put by validation into firebird.log ?

hvlad commented 2 months ago

After the message 'finishing, closing, and going home' there should be 'adjusting the ONLINE and FORCED WRITES flags'. Looks like something bad happens when disconnecting from just restored database. Could you also show gstat -h output for such database (it should show shutdown mode as 'multi-user maintenance') ?

gsbelarus commented 2 months ago

@hvlad

Gstat execution time Mon Mar 18 16:29:38 2024

Database header page information:
        Flags                   0
        Generation              13247
        System Change Number    0
        Page size               8192
        ODS version             13.1
        Oldest transaction      3294
        Oldest active           3295
        Oldest snapshot         3295
        Next transaction        13226
        Sequence number         0
        Next attachment ID      15
        Implementation          HW=AMD/Intel/x64 little-endian OS=Windows CC=MSVC
        Shadow count            0
        Page buffers            500
        Next header page        0
        Database dialect        3
        Creation date           Mar 17, 2024 22:56:46
        Attributes              multi-user maintenance

    Variable header data:
        Database GUID:  {5FC2040A-4EC7-4E3C-8732-6FCBA1318CC6}
        Sweep interval:         0
        *END*
Gstat completion time Mon Mar 18 16:29:38 2024
aafemt commented 2 months ago

The error continues to occur in FB 5.0.1. For us, it is a great blocker as we stopped all works on upgrading customer's servers until resolving the situation.

Try to use gbak without -se switch and check if the problem persist.

hvlad commented 2 months ago

The error continues to occur in FB 5.0.1. For us, it is a great blocker as we stopped all works on upgrading customer's servers until resolving the situation.

Try to use gbak without -se switch and check if the problem persist.

He means - run embedded (!) restore and use procdump to watch for supposed crash.

aafemt commented 2 months ago

Not necessarily embedded, locahost attachment is fine. The key is not to use services: I guess if this problem may be related to Firebird server crash on service disconnect which I've once seen.

hvlad commented 2 months ago

I downloaded the latest build 1305 with debug symbols, and, run it twice. The first time with procdump attached to the service and the second one -- without procdump. In both cases, the process of restoring ended without any error messages and the server didn't crash. But anyway, the subsequent run of gfix utility shows that the newly restored database file is corrupted:

Summary of validation errors

        Number of record level errors   : 18722

If you run procdump against Windows service process - how do you sure procdump have enough privileges to watch for the process and create dump file ? Also, it is easy to check Windows Event log if serivce was crashed and restarted.

hvlad commented 2 months ago

Not necessarily embedded, locahost attachment is fine. The key is not to use services: I guess if this problem may be related to Firebird server crash on service disconnect which I've once seen.

How it will help us to catch the crash ?

aafemt commented 2 months ago

How it will help us to catch the crash ?

It would narrow borders for testcase creation. If problem is related to TCP server, it technically cannot be reproduced in embedded mode.

gsbelarus commented 2 months ago

relevant records from firebird.log:

LEPEL   Mon Mar 18 04:30:58 2024
    INET/inet_error: read errno = 10054, server host = localhost, address = ::1/3056

LEPEL   Mon Mar 18 16:32:12 2024
    Database: G:\XXXXX.FDB
    Validation started

LEPEL   Mon Mar 18 17:18:06 2024
    Database: G:\XXXX.FDB
    Validation finished: 0 errors, 0 warnings, 0 fixed