Firebird server stops accepting new connections after some time

agx4ever commented 1 year ago

I have server that runs FB3 and I want to migrate to FB4. I have created new test server and installed latest FB4. It works fine until one day. It can work few days or max 2 weeks without problems and then suddenly firebird server just stops accepting new connections. On server I can see firebird in process list, but it simply doesn't accept new connections. When I stop and then start firebird - it works fine again. Error log does not show anything unusual.

I tried same installation and same configuration on different server, to exclude hardware problems or software misconfigurations - and the result is same - FB process stops accepting new connections after some time.

OS: Linux, CentOS Stream release 8 Firebird 4.0.2 - Firebird-4.0.2.2816-0.amd64

--- firebird.conf --- TempDirectories = /mnt/data0/fb4/tmp/ DefaultDbCachePages = 2048 UseFileSystemCache = true TempBlockSize = 8M TempCacheLimit = 64M InlineSortThreshold = 2048 AuthServer = Srp256 AuthClient = Srp256, Srp UserManager = Srp ReadConsistency = 0 RemoteServicePort = 3050 LockMemSize = 1M LockHashSlots = 8191 ServerMode = SuperClassic

--- databases.conf --- dev_main = /mnt/data0/fb4/dev_main.fdb { DatabaseGrowthIncrement = 128M DeadlockTimeout = 10 DefaultDbCachePages = 32768 FileSystemCacheThreshold = 1048576 GCPolicy = combined LockHashSlots = 49999 LockMemSize = 40M } --- no replication configuration ---

Last time when the problem occurred I made fbguard and firebird process dumps with "gcore" command. I can send those dumps in email (or other convenient way, just tell how). If there is anything else I can do, to provide more information, please tell me.

sim1984 commented 1 year ago

The current database level configuration is more suitable for SuperServer mode than SuperClassic. The following values are too large:

DefaultDbCachePages = 32768
GCPolicy = combined

agx4ever commented 1 year ago

This is configuration for my test server. For database I am using pagesize 32768 and together with "DefaultDbCachePages = 32768" - each connection takes ~1GB of RAM. I have 40GB RAM installed and usually there are few connections simultaneously (for test server). If you think that this setting could be reason for this problem - I can reduce it.

For GCPolicy - I will change it to "cooperative". I have turned sweep process off and sweeping is done manually with gfix - I thought that it doesn't affect anything important.

hvlad commented 1 year ago

On server I can see firebird in process list, but it simply doesn't accept new connections.

What kind of connection string you use ? What error is returned by application ? Does you tried to connect using isql in such moment ? Could you try embedded (hostless) connection ?

GCPolicy = combined

Doesn't matters for non-SS architectures

agx4ever commented 1 year ago

What kind of connection string you use ?

From IBExpert: connect 'my_dev_dns/3050:dev_main' user "SYSDBA" password 'VerySecurePass';

What error is returned by application ?

Mostly I am using Java and Jaybird driver (4.0.8.java11). When Firebird hangs - I get errors:

java.sql.SQLNonTransientConnectionException: Unable to complete network request to host "xx.xx.xx.xx". [SQLState:08006, ISC error code:335544721]
        at org.firebirdsql.gds.ng.FbExceptionBuilder$Type$5.createSQLException(FbExceptionBuilder.java:598)
        at org.firebirdsql.gds.ng.FbExceptionBuilder$ExceptionInformation.toSQLException(FbExceptionBuilder.java:492)
        at org.firebirdsql.gds.ng.FbExceptionBuilder.toSQLException(FbExceptionBuilder.java:223)
        at org.firebirdsql.gds.ng.wire.WireConnection.socketConnect(WireConnection.java:236)
        at org.firebirdsql.gds.ng.wire.FbWireDatabaseFactory.performConnect(FbWireDatabaseFactory.java:50)
        at org.firebirdsql.gds.ng.wire.FbWireDatabaseFactory.connect(FbWireDatabaseFactory.java:39)
        at org.firebirdsql.gds.ng.wire.FbWireDatabaseFactory.connect(FbWireDatabaseFactory.java:32)
        at org.firebirdsql.jca.FBManagedConnection.<init>(FBManagedConnection.java:145)
        at org.firebirdsql.jca.FBManagedConnectionFactory.createManagedConnection(FBManagedConnectionFactory.java:599)
        at org.firebirdsql.jca.FBStandAloneConnectionManager.allocateConnection(FBStandAloneConnectionManager.java:65)
        at org.firebirdsql.jdbc.FBDataSource.getConnection(FBDataSource.java:109)
        at org.firebirdsql.jdbc.FBDriver.connect(FBDriver.java:114)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:677)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:228)

Also when connecting from IBExpert - it waits around one minute and then says connection failed.

Does you tried to connect using isql in such moment ? Could you try embedded (hostless) connection ?

Not yet - I will try this next time when server will freeze. It's sad, that I can't reproduce this problem on demand.

agx4ever commented 1 year ago

From Java I'm using this connection string:

jdbc:firebirdsql://%s:%s/%s?charSet=%s&encoding=%s&roleName=%s&TRANSACTION_READ_COMMITTED=isc_tpb_read_committed,isc_tpb_nowait,isc_tpb_rec_version

Today FB4 server again froze. So I tried to connect from isql tool from same server.

[root@dev1 fb402]# ./isql
Use CONNECT or CREATE DATABASE to specify a database
SQL> connect 127.0.0.1:dev_main user SYSDBA password 'VerySecurePass';
Statement failed, SQLSTATE = 08004
connection rejected by remote interface
SQL>

It took around 1 minute before error message. After that I tried embedded connection without specifying user/pass.

[root@dev1 fb402]# ./isql
Use CONNECT or CREATE DATABASE to specify a database
SQL> connect /data/fb4/dev_main.fdb ;

It never connected - there was no error messages or anything.

hvlad commented 1 year ago

Could you provide full memory dump of firebird process and another one of hung isql (with embedded connection) ?

agx4ever commented 1 year ago

Could you provide full memory dump of firebird process and another one of hung isql (with embedded connection) ?

Today I encountered same problem. I made asked memory dumps. isql dump was quite small, but firebird process dump took a while and it is ~140GB large. Compressed file is ~230MB large. I can send dropbox share link but only privately (as it may contain sensitive data). My guess would be, that Firebird freezes due to memory allocation problems, but then it should write according error message or something like that. Also server memory and swap space wasn't fully used. Where I can send share link?

AlexPeshkoff commented 1 year ago

This core file should go to me, but how was it compressed? Please make sure you've used tar with --sparse switch to process core dump, in other case I may have problems decompressing it. Also xz may be used - it automatically detects sparse files. Link to be send to peshkoff@mail.ru.

agx4ever commented 1 year ago

This core file should go to me, but how was it compressed? Please make sure you've used tar with --sparse switch to process core dump, in other case I may have problems decompressing it. Also xz may be used - it automatically detects sparse files. Link to be send to peshkoff@mail.ru.

I just sent link for dump files. Dump was compressed with "gzip -9 " command.

AlexPeshkoff commented 1 year ago

Please use xz next time instead gzip for core dumps (file was decompressing >hour - due to disk load). It's not an issue with compressed size, it's about sparse core dump.

AlexPeshkoff commented 1 year ago

Also I need he following libraries from your box: ld-linux-x86-64.so.2 libc.so.6 libdl.so.2 libgcc_s.so.1 libgpm.so.2 libm.so.6 libpthread.so.0 librt.so.1 libstdc++.so.6 libthread_db.so.1 libtommath.so.0 libz.so.1

agx4ever commented 1 year ago

Sorry for using wrong compression method. I just added required libs from Linux server to the same shared folder. I used tar xz compression method this time. If you need anything else - just ask.

AlexPeshkoff commented 1 year ago

Definitely wrong libraries: Error while mapping shared library sections: /opt/lib64/libdl.so.2': Shared library architecture i386 is not compatible with target architecture i386:x86-64. Error while mapping shared library sections: /opt/lib64/libm.so.6': Shared library architecture i386 is not compatible with target architecture i386:x86-64. Error while mapping shared library sections: /opt/lib64/libgcc_s.so.1': Shared library architecture i386 is not compatible with target architecture i386:x86-64. Error while mapping shared library sections: /opt/lib64/libpthread.so.0': Shared library architecture i386 is not compatible with target architecture i386:x86-64. Error while mapping shared library sections: /opt/lib64/libc.so.6': Shared library architecture i386 is not compatible with target architecture i386:x86-64. Error while mapping shared library sections: /opt/lib64/libstdc++.so.6': Shared library architecture i386 is not compatible with target architecture i386:x86-64.

agx4ever commented 1 year ago

Sorry. It looks I have copied some wrong files from 32bit folder. I just uploaded correct libraries from x64 directory. Check once more shared folder.

AlexPeshkoff commented 1 year ago

Sooner of all your hang is already fixed in current codebase. Please try current snapshot. In any case it should provide more informative core dumps.

PS. If snapshot anyway hangs (with current dump it's hard to diagnose exact reason) please do not try to attach to server 16000 times - almost all core dump (>90% size) contains stacks of attach threads waiting in same place.

agx4ever commented 1 year ago

I installed 4.0.3 snapshot build and it worked almost 2 weeks without problems. But today Firebird stalled. I made another core dump and uploaded in that same file share in folder named "2023-mar-27". If there is needed more information or there are some recommendations - just tell me.

AlexPeshkoff commented 1 year ago

I also need snapshot binaries + d4ebug info.

agx4ever commented 1 year ago

I uploaded Firebird 4.0.3 binaries I'm using. What kind of debug info you need?

AlexPeshkoff commented 1 year ago

One which came with that file - Firebird-debuginfo-4.0.3.2906-0.amd64.tar.gz

agx4ever commented 1 year ago

I don't have debuginfo archive from that snapshot build :( and also there is no snapshot archive available on firebird download page. I didn't know that I have to save debuginfo archive when downloading snapshot. I will install today's snapshot and will save debuginfo archive and try to replicate problem.

agx4ever commented 1 year ago

After last problem I installed newest 4.0.3 snapshot build and now it worked around 3 weeks without problems. But today again Firebird stalled. So now I made another core dump and uploaded in that same file share in folder named "2023-apr-24". I also included Firebird snapshot binaries and debuginfo package. If there is something additional needed - just ask.

AlexPeshkoff commented 1 year ago

Once again new case never seen before in your dumps. Though symptoms may look similar - but definitely other reason.

Sorry, the only thing I could do this time is enhance debugging information collecting (https://github.com/FirebirdSQL/firebird/commit/3019afa0e5be924b8374b2aa5f5fdeec14228d33).

agx4ever commented 1 year ago

Sorry for long silence on this issue. I was playing with different configurations to seek some clues on this problem. I got few times when Firebird got stalled. I even restored database from backup, to rule out metadata corruptions possibility. This time from fresh restart Firebird worked around 5 days and then today (to be precise - this night) again stalled. I made core dump, included firebird version and also debuginfo package. I uploaded everything at same share as previous in folder "2023-jun-08". In case you need something additional or can't access - just let me know. Suggestions or ideas are also welcome. Strange, that no one else sees same problems.

agx4ever commented 1 year ago

Today I made another 2 dumps that I believe is right before Firebird hangs up. I executed simple update query to update one field by primary key and it just stalled and never executed (from IBExpert). After that I connected from other IBExpert to database and wanted to kill my previous connection. So I opened "Database Monitoring" tool and tried to list all active statements. It executes code: select st.mon$statement_id as Statement_ID, st.mon$attachment_id as Attachment_ID, st.mon$explained_plan as Explained_Plan, st.mon$transaction_id Transaction_ID, a.mon$user as User_Name, a.mon$remote_address as Remote_Address, a.mon$remote_pid as Remote_PID, a.mon$remote_process as Remote_Process, a.mon$client_version as Client_version, a.mon$remote_version as Remote_Protocol_Version, a.mon$remote_host as Remote_Host_Name, a.mon$remote_os_user as Remote_User_Name, a.mon$auth_method as Authentication_Method, case when a.mon$system_flag = 0 then 'Normal' when a.mon$system_flag = 1 then 'System' end as Connection_Type, a.mon$idle_timeout as Idle_Timeout, a.mon$idle_timer as Idle_Timer, a.mon$statement_timeout as Statement_Timeout, a.mon$wire_compressed as Wire_Compressed, a.mon$wire_encrypted as Wire_Encrypted, a.mon$wire_crypt_plugin as Wire_Crypt_Plugin, case when st.mon$state = 0 then 'IDLE' when st.mon$state = 1 then 'ACTIVE' end as State, st.mon$timestamp Started_At, st.mon$sql_text Statement_Text, st.mon$statement_timeout as Statement_Timeout, st.mon$statement_timer as Statement_Timer, r.mon$record_seq_reads as Non_indexed_Reads, r.mon$record_idx_reads as Indexed_Reads, r.mon$record_inserts as Records_Inserted, r.mon$record_updates as Records_Updated, r.mon$record_deletes as Records_Deleted, r.mon$record_backouts as Records_Backed_Out, r.mon$record_purges as Records_Purged, r.mon$record_expunges as Records_Expunged, r.mon$record_locks as Record_Locks, r.mon$record_waits as Record_Waits, r.mon$record_conflicts as Record_Conflicts, r.mon$backversion_reads as Backversion_Reads, r.mon$fragment_reads as Fragment_Reads, r.mon$record_rpt_reads as Record_Rpt_Reads, r.mon$record_imgc as Records_IMGC, io.mon$page_reads as Page_Reads, io.mon$page_writes as Page_Writes, io.mon$page_fetches as Page_Fetches, io.mon$page_marks as Page_Marks from mon$statements st join mon$attachments a on a.mon$attachment_id = st.mon$attachment_id join mon$record_stats r on (st.mon$stat_id = r.mon$stat_id) join mon$io_stats io on (st.mon$stat_id = io.mon$stat_id) order by st.mon$timestamp It never executed - just stalled. Then I tried to close my stalled connections from few more computers, but those connections got stuck in same manner. I believe it's beginning of Firebird hanging up. In order to continue working - I restarted Firebird service and then my update query and monitoring queries worked just fine. Between those steps I made two core dumps. I uploaded them in "2023-jun-13" folder in same share. Firebird and libs are same as found in "2023-jun-08" folder.

EPluribusUnum commented 1 year ago

From time to time we also have the sam issue, Firebirds stops acceppting new connections and select with MON$ tables freeze in active clients. Unfortunately we could not produce dump. Hope this issue will be resolved with the help of new dumps.

AlexPeshkoff commented 1 year ago

I do not remember where from to download core dumps. Also please put there binaries & debug info.

agx4ever commented 1 year ago

@AlexPeshkoff I just resent access information to core dumps to your email.

AlexPeshkoff commented 1 year ago

Looks like you have embedded connections to your database, and that embedded connections hang sometimes. I see no other reasons for current behavior. To better understand what happens please next time when you have that problem in addition to core dump do the following: fb_lock_print -d /srv/fb4/dev_main.fdb -c -a >somefile.txt and add somefile.txt together with core dump.

agx4ever commented 1 year ago

Today again FB started to show hanging symptoms and I made core dump and also fb_lock_print as suggested into somefile.txt ;) All requested files are uploaded to the same share under folder: 2023-jun-19

AlexPeshkoff commented 1 year ago

On 6/19/23 11:02, agx4ever wrote:

Today again FB started to show hanging symptoms and I made core dump and also fb_lock_print as suggested into somefile.txt ;) All requested files are uploaded to the same share under folder: 2023-jun-19

Does procedure XRF_IS_UNIT_COMPENSATED$S appears to you interesting (not trivial) in any aspect ?

AlexPeshkoff commented 1 year ago

On 6/19/23 11:02, agx4ever wrote:

Today again FB started to show hanging symptoms and I made core dump and also fb_lock_print as suggested into somefile.txt ;) All requested files are uploaded to the same share under folder: 2023-jun-19

I've found something interesting in this dump / lock_print. Are you ready to run special build (some devel checks missing in regular production build will be turned on)?

agx4ever commented 1 year ago

Yes, of course I'm ready to run special build. Just give it to me and I definitely give it a try.

agx4ever commented 1 year ago

Update on issue. I have installed special build from @AlexPeshkoff with built in debugging / core dump when suspicious conditions are met. Firebird have crashed already few times and it should have produced core dumps, but because of my server misconfiguration - all those core dumps were truncated and are useless. I have now reconfigured server (few times actually) to save full coredumps and I hope that soon I will have necessary debug info. My bet is that those debug/suspicious conditions is the right place in code for this problem, because - now when firebird crashes, it produces coredump and systemctl process restarts Firebird. It no more stays in halted/hanged state.

agx4ever commented 1 year ago

I have acquired successful 4 core dumps with provided special FB build. All files are uploaded at previous file share under folder "2023-jul-29". There are also debuginfo, Firebird binaries and libs used. If there is something else needed - just ask. Thank you for your support!

AlexPeshkoff commented 1 year ago

Good news - all 4 dumps show exactly the state that I've expected, all are reasonably same and rather informative. I also need your firebird.log and exact times when dumps were created - dropbox looses file creation time info.

agx4ever commented 1 year ago

Very good news! I just uploaded firebird.log file in same folder. These 4 uploaded coredumps and exact times you can find when reading from end of log file. There are older abnormal termination entries as well.

AlexPeshkoff commented 1 year ago

I see you've sent very truncated log. But what is in log AFTER abort is not interesting, I want to see did something happen right BEFORE abort.

agx4ever commented 1 year ago

It's full log as it is on server. I haven't removed any entry. There is nothing interesting there. Maybe there are options to output more detailed info? If yes - I need instructions how to set up such logging.

AlexPeshkoff commented 1 year ago

Sorry - looked truncated. And no - there are no such options. OK, negative result is also result.

AlexPeshkoff commented 1 year ago

Please install new special build from https://drive.google.com/drive/folders/14JaiJoRBNhgHBkfolnBHZZDP6pu9Owg0?usp=sharing As soon as you get first core - report about it please.

agx4ever commented 1 year ago

Thank you for your fast response! As you asked - I installed special build and today I got new coredump. As always - I uploaded it at previous file share under folder "2023-aug-08" and I attached log file as well.

AlexPeshkoff commented 1 year ago

FB3 is almost unaffected - AST on change encryption state should not happen too often (unlike TPC one since FB4). Anyway backported required part of fix to it.

AlexPeshkoff commented 1 year ago

@agx4ever You can upgrade to tomorrow snapshot (just make sure it's OK on http://firebirdtest.com/), it will contain fix for your bug. But if you can provide me 2 or 3 more dumps it will help us make sure we fixed all possible reasons of a bug.

agx4ever commented 10 months ago

After I installed snapshot build with this fix - everything works fine and Firebird server hasn't crashed already two months. It seems that this issue is fixed. Thank you for your fast support and problem debugging! When this fix will be published in regular version build?

AlexPeshkoff commented 10 months ago

On 10/13/23 11:16, agx4ever wrote:

After I installed snapshot build with this fix - everything works fine and Firebird server hasn't crashed already two months. It seems that this issue is fixed. Thank you for your fast support and problem debugging! When this fix will be published in regular version build?

It's present in 4.0.3, but I highly recommend you wait for 4.0.4 - new regressions are too bad.

FirebirdSQL / firebird

Firebird server stops accepting new connections after some time #7480