Closed agx4ever closed 1 year ago
The current database level configuration is more suitable for SuperServer mode than SuperClassic. The following values are too large:
DefaultDbCachePages = 32768
GCPolicy = combined
This is configuration for my test server. For database I am using pagesize 32768 and together with "DefaultDbCachePages = 32768" - each connection takes ~1GB of RAM. I have 40GB RAM installed and usually there are few connections simultaneously (for test server). If you think that this setting could be reason for this problem - I can reduce it.
For GCPolicy - I will change it to "cooperative". I have turned sweep process off and sweeping is done manually with gfix - I thought that it doesn't affect anything important.
On server I can see firebird in process list, but it simply doesn't accept new connections.
What kind of connection string you use ? What error is returned by application ? Does you tried to connect using isql in such moment ? Could you try embedded (hostless) connection ?
GCPolicy = combined
Doesn't matters for non-SS architectures
What kind of connection string you use ?
From IBExpert: connect 'my_dev_dns/3050:dev_main' user "SYSDBA" password 'VerySecurePass';
What error is returned by application ?
Mostly I am using Java and Jaybird driver (4.0.8.java11). When Firebird hangs - I get errors:
java.sql.SQLNonTransientConnectionException: Unable to complete network request to host "xx.xx.xx.xx". [SQLState:08006, ISC error code:335544721]
at org.firebirdsql.gds.ng.FbExceptionBuilder$Type$5.createSQLException(FbExceptionBuilder.java:598)
at org.firebirdsql.gds.ng.FbExceptionBuilder$ExceptionInformation.toSQLException(FbExceptionBuilder.java:492)
at org.firebirdsql.gds.ng.FbExceptionBuilder.toSQLException(FbExceptionBuilder.java:223)
at org.firebirdsql.gds.ng.wire.WireConnection.socketConnect(WireConnection.java:236)
at org.firebirdsql.gds.ng.wire.FbWireDatabaseFactory.performConnect(FbWireDatabaseFactory.java:50)
at org.firebirdsql.gds.ng.wire.FbWireDatabaseFactory.connect(FbWireDatabaseFactory.java:39)
at org.firebirdsql.gds.ng.wire.FbWireDatabaseFactory.connect(FbWireDatabaseFactory.java:32)
at org.firebirdsql.jca.FBManagedConnection.<init>(FBManagedConnection.java:145)
at org.firebirdsql.jca.FBManagedConnectionFactory.createManagedConnection(FBManagedConnectionFactory.java:599)
at org.firebirdsql.jca.FBStandAloneConnectionManager.allocateConnection(FBStandAloneConnectionManager.java:65)
at org.firebirdsql.jdbc.FBDataSource.getConnection(FBDataSource.java:109)
at org.firebirdsql.jdbc.FBDriver.connect(FBDriver.java:114)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:677)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:228)
Also when connecting from IBExpert - it waits around one minute and then says connection failed.
Does you tried to connect using isql in such moment ? Could you try embedded (hostless) connection ?
Not yet - I will try this next time when server will freeze. It's sad, that I can't reproduce this problem on demand.
From Java I'm using this connection string:
jdbc:firebirdsql://%s:%s/%s?charSet=%s&encoding=%s&roleName=%s&TRANSACTION_READ_COMMITTED=isc_tpb_read_committed,isc_tpb_nowait,isc_tpb_rec_version
Today FB4 server again froze. So I tried to connect from isql tool from same server.
[root@dev1 fb402]# ./isql
Use CONNECT or CREATE DATABASE to specify a database
SQL> connect 127.0.0.1:dev_main user SYSDBA password 'VerySecurePass';
Statement failed, SQLSTATE = 08004
connection rejected by remote interface
SQL>
It took around 1 minute before error message. After that I tried embedded connection without specifying user/pass.
[root@dev1 fb402]# ./isql
Use CONNECT or CREATE DATABASE to specify a database
SQL> connect /data/fb4/dev_main.fdb ;
It never connected - there was no error messages or anything.
Could you provide full memory dump of firebird process and another one of hung isql (with embedded connection) ?
Could you provide full memory dump of firebird process and another one of hung isql (with embedded connection) ?
Today I encountered same problem. I made asked memory dumps. isql dump was quite small, but firebird process dump took a while and it is ~140GB large. Compressed file is ~230MB large. I can send dropbox share link but only privately (as it may contain sensitive data). My guess would be, that Firebird freezes due to memory allocation problems, but then it should write according error message or something like that. Also server memory and swap space wasn't fully used. Where I can send share link?
This core file should go to me, but how was it compressed? Please make sure you've used tar with --sparse switch to process core dump, in other case I may have problems decompressing it. Also xz may be used - it automatically detects sparse files. Link to be send to peshkoff@mail.ru.
This core file should go to me, but how was it compressed? Please make sure you've used tar with --sparse switch to process core dump, in other case I may have problems decompressing it. Also xz may be used - it automatically detects sparse files. Link to be send to peshkoff@mail.ru.
I just sent link for dump files.
Dump was compressed with "gzip -9
Please use xz next time instead gzip for core dumps (file was decompressing >hour - due to disk load). It's not an issue with compressed size, it's about sparse core dump.
Also I need he following libraries from your box: ld-linux-x86-64.so.2 libc.so.6 libdl.so.2 libgcc_s.so.1 libgpm.so.2 libm.so.6 libpthread.so.0 librt.so.1 libstdc++.so.6 libthread_db.so.1 libtommath.so.0 libz.so.1
Sorry for using wrong compression method. I just added required libs from Linux server to the same shared folder. I used tar xz compression method this time. If you need anything else - just ask.
Definitely wrong libraries:
Error while mapping shared library sections:
/opt/lib64/libdl.so.2': Shared library architecture i386 is not compatible with target architecture i386:x86-64. Error while mapping shared library sections:
/opt/lib64/libm.so.6': Shared library architecture i386 is not compatible with target architecture i386:x86-64.
Error while mapping shared library sections:
/opt/lib64/libgcc_s.so.1': Shared library architecture i386 is not compatible with target architecture i386:x86-64. Error while mapping shared library sections:
/opt/lib64/libpthread.so.0': Shared library architecture i386 is not compatible with target architecture i386:x86-64.
Error while mapping shared library sections:
/opt/lib64/libc.so.6': Shared library architecture i386 is not compatible with target architecture i386:x86-64. Error while mapping shared library sections:
/opt/lib64/libstdc++.so.6': Shared library architecture i386 is not compatible with target architecture i386:x86-64.
Sorry. It looks I have copied some wrong files from 32bit folder. I just uploaded correct libraries from x64 directory. Check once more shared folder.
Sooner of all your hang is already fixed in current codebase. Please try current snapshot. In any case it should provide more informative core dumps.
PS. If snapshot anyway hangs (with current dump it's hard to diagnose exact reason) please do not try to attach to server 16000 times - almost all core dump (>90% size) contains stacks of attach threads waiting in same place.
I installed 4.0.3 snapshot build and it worked almost 2 weeks without problems. But today Firebird stalled. I made another core dump and uploaded in that same file share in folder named "2023-mar-27". If there is needed more information or there are some recommendations - just tell me.
I also need snapshot binaries + d4ebug info.
I uploaded Firebird 4.0.3 binaries I'm using. What kind of debug info you need?
One which came with that file - Firebird-debuginfo-4.0.3.2906-0.amd64.tar.gz
I don't have debuginfo archive from that snapshot build :( and also there is no snapshot archive available on firebird download page. I didn't know that I have to save debuginfo archive when downloading snapshot. I will install today's snapshot and will save debuginfo archive and try to replicate problem.
After last problem I installed newest 4.0.3 snapshot build and now it worked around 3 weeks without problems. But today again Firebird stalled. So now I made another core dump and uploaded in that same file share in folder named "2023-apr-24". I also included Firebird snapshot binaries and debuginfo package. If there is something additional needed - just ask.
Once again new case never seen before in your dumps. Though symptoms may look similar - but definitely other reason.
Sorry, the only thing I could do this time is enhance debugging information collecting (https://github.com/FirebirdSQL/firebird/commit/3019afa0e5be924b8374b2aa5f5fdeec14228d33).
Sorry for long silence on this issue. I was playing with different configurations to seek some clues on this problem. I got few times when Firebird got stalled. I even restored database from backup, to rule out metadata corruptions possibility. This time from fresh restart Firebird worked around 5 days and then today (to be precise - this night) again stalled. I made core dump, included firebird version and also debuginfo package. I uploaded everything at same share as previous in folder "2023-jun-08". In case you need something additional or can't access - just let me know. Suggestions or ideas are also welcome. Strange, that no one else sees same problems.
Today I made another 2 dumps that I believe is right before Firebird hangs up.
I executed simple update query to update one field by primary key and it just stalled and never executed (from IBExpert).
After that I connected from other IBExpert to database and wanted to kill my previous connection. So I opened "Database Monitoring" tool and tried to list all active statements. It executes code:
select st.mon$statement_id as Statement_ID, st.mon$attachment_id as Attachment_ID, st.mon$explained_plan as Explained_Plan, st.mon$transaction_id Transaction_ID, a.mon$user as User_Name, a.mon$remote_address as Remote_Address, a.mon$remote_pid as Remote_PID, a.mon$remote_process as Remote_Process, a.mon$client_version as Client_version, a.mon$remote_version as Remote_Protocol_Version, a.mon$remote_host as Remote_Host_Name, a.mon$remote_os_user as Remote_User_Name, a.mon$auth_method as Authentication_Method, case when a.mon$system_flag = 0 then 'Normal' when a.mon$system_flag = 1 then 'System' end as Connection_Type, a.mon$idle_timeout as Idle_Timeout, a.mon$idle_timer as Idle_Timer, a.mon$statement_timeout as Statement_Timeout, a.mon$wire_compressed as Wire_Compressed, a.mon$wire_encrypted as Wire_Encrypted, a.mon$wire_crypt_plugin as Wire_Crypt_Plugin, case when st.mon$state = 0 then 'IDLE' when st.mon$state = 1 then 'ACTIVE' end as State, st.mon$timestamp Started_At, st.mon$sql_text Statement_Text, st.mon$statement_timeout as Statement_Timeout, st.mon$statement_timer as Statement_Timer, r.mon$record_seq_reads as Non_indexed_Reads, r.mon$record_idx_reads as Indexed_Reads, r.mon$record_inserts as Records_Inserted, r.mon$record_updates as Records_Updated, r.mon$record_deletes as Records_Deleted, r.mon$record_backouts as Records_Backed_Out, r.mon$record_purges as Records_Purged, r.mon$record_expunges as Records_Expunged, r.mon$record_locks as Record_Locks, r.mon$record_waits as Record_Waits, r.mon$record_conflicts as Record_Conflicts, r.mon$backversion_reads as Backversion_Reads, r.mon$fragment_reads as Fragment_Reads, r.mon$record_rpt_reads as Record_Rpt_Reads, r.mon$record_imgc as Records_IMGC, io.mon$page_reads as Page_Reads, io.mon$page_writes as Page_Writes, io.mon$page_fetches as Page_Fetches, io.mon$page_marks as Page_Marks from mon$statements st join mon$attachments a on a.mon$attachment_id = st.mon$attachment_id join mon$record_stats r on (st.mon$stat_id = r.mon$stat_id) join mon$io_stats io on (st.mon$stat_id = io.mon$stat_id) order by st.mon$timestamp
It never executed - just stalled. Then I tried to close my stalled connections from few more computers, but those connections got stuck in same manner. I believe it's beginning of Firebird hanging up.
In order to continue working - I restarted Firebird service and then my update query and monitoring queries worked just fine.
Between those steps I made two core dumps. I uploaded them in "2023-jun-13" folder in same share.
Firebird and libs are same as found in "2023-jun-08" folder.
From time to time we also have the sam issue, Firebirds stops acceppting new connections and select with MON$ tables freeze in active clients. Unfortunately we could not produce dump. Hope this issue will be resolved with the help of new dumps.
I do not remember where from to download core dumps. Also please put there binaries & debug info.
@AlexPeshkoff I just resent access information to core dumps to your email.
Looks like you have embedded connections to your database, and that embedded connections hang sometimes. I see no other reasons for current behavior. To better understand what happens please next time when you have that problem in addition to core dump do the following: fb_lock_print -d /srv/fb4/dev_main.fdb -c -a >somefile.txt and add somefile.txt together with core dump.
Today again FB started to show hanging symptoms and I made core dump and also fb_lock_print as suggested into somefile.txt ;) All requested files are uploaded to the same share under folder: 2023-jun-19
On 6/19/23 11:02, agx4ever wrote:
Today again FB started to show hanging symptoms and I made core dump and also fb_lock_print as suggested into somefile.txt ;) All requested files are uploaded to the same share under folder: 2023-jun-19
Does procedure XRF_IS_UNIT_COMPENSATED$S appears to you interesting (not trivial) in any aspect ?
On 6/19/23 11:02, agx4ever wrote:
Today again FB started to show hanging symptoms and I made core dump and also fb_lock_print as suggested into somefile.txt ;) All requested files are uploaded to the same share under folder: 2023-jun-19
I've found something interesting in this dump / lock_print. Are you ready to run special build (some devel checks missing in regular production build will be turned on)?
Yes, of course I'm ready to run special build. Just give it to me and I definitely give it a try.
Update on issue. I have installed special build from @AlexPeshkoff with built in debugging / core dump when suspicious conditions are met. Firebird have crashed already few times and it should have produced core dumps, but because of my server misconfiguration - all those core dumps were truncated and are useless. I have now reconfigured server (few times actually) to save full coredumps and I hope that soon I will have necessary debug info. My bet is that those debug/suspicious conditions is the right place in code for this problem, because - now when firebird crashes, it produces coredump and systemctl process restarts Firebird. It no more stays in halted/hanged state.
I have acquired successful 4 core dumps with provided special FB build. All files are uploaded at previous file share under folder "2023-jul-29". There are also debuginfo, Firebird binaries and libs used. If there is something else needed - just ask. Thank you for your support!
Good news - all 4 dumps show exactly the state that I've expected, all are reasonably same and rather informative. I also need your firebird.log and exact times when dumps were created - dropbox looses file creation time info.
Very good news! I just uploaded firebird.log file in same folder. These 4 uploaded coredumps and exact times you can find when reading from end of log file. There are older abnormal termination entries as well.
I see you've sent very truncated log. But what is in log AFTER abort is not interesting, I want to see did something happen right BEFORE abort.
It's full log as it is on server. I haven't removed any entry. There is nothing interesting there. Maybe there are options to output more detailed info? If yes - I need instructions how to set up such logging.
Sorry - looked truncated. And no - there are no such options. OK, negative result is also result.
Please install new special build from https://drive.google.com/drive/folders/14JaiJoRBNhgHBkfolnBHZZDP6pu9Owg0?usp=sharing As soon as you get first core - report about it please.
Thank you for your fast response! As you asked - I installed special build and today I got new coredump. As always - I uploaded it at previous file share under folder "2023-aug-08" and I attached log file as well.
FB3 is almost unaffected - AST on change encryption state should not happen too often (unlike TPC one since FB4). Anyway backported required part of fix to it.
@agx4ever You can upgrade to tomorrow snapshot (just make sure it's OK on http://firebirdtest.com/), it will contain fix for your bug. But if you can provide me 2 or 3 more dumps it will help us make sure we fixed all possible reasons of a bug.
After I installed snapshot build with this fix - everything works fine and Firebird server hasn't crashed already two months. It seems that this issue is fixed. Thank you for your fast support and problem debugging! When this fix will be published in regular version build?
On 10/13/23 11:16, agx4ever wrote:
After I installed snapshot build with this fix - everything works fine and Firebird server hasn't crashed already two months. It seems that this issue is fixed. Thank you for your fast support and problem debugging! When this fix will be published in regular version build?
It's present in 4.0.3, but I highly recommend you wait for 4.0.4 - new regressions are too bad.
I have server that runs FB3 and I want to migrate to FB4. I have created new test server and installed latest FB4. It works fine until one day. It can work few days or max 2 weeks without problems and then suddenly firebird server just stops accepting new connections. On server I can see firebird in process list, but it simply doesn't accept new connections. When I stop and then start firebird - it works fine again. Error log does not show anything unusual.
I tried same installation and same configuration on different server, to exclude hardware problems or software misconfigurations - and the result is same - FB process stops accepting new connections after some time.
OS: Linux, CentOS Stream release 8 Firebird 4.0.2 - Firebird-4.0.2.2816-0.amd64
--- firebird.conf --- TempDirectories = /mnt/data0/fb4/tmp/ DefaultDbCachePages = 2048 UseFileSystemCache = true TempBlockSize = 8M TempCacheLimit = 64M InlineSortThreshold = 2048 AuthServer = Srp256 AuthClient = Srp256, Srp UserManager = Srp ReadConsistency = 0 RemoteServicePort = 3050 LockMemSize = 1M LockHashSlots = 8191 ServerMode = SuperClassic
--- databases.conf --- dev_main = /mnt/data0/fb4/dev_main.fdb { DatabaseGrowthIncrement = 128M DeadlockTimeout = 10 DefaultDbCachePages = 32768 FileSystemCacheThreshold = 1048576 GCPolicy = combined LockHashSlots = 49999 LockMemSize = 40M } --- no replication configuration ---
Last time when the problem occurred I made fbguard and firebird process dumps with "gcore" command. I can send those dumps in email (or other convenient way, just tell how). If there is anything else I can do, to provide more information, please tell me.