Closed firebird-automations closed 6 years ago
Commented by: Sascha Michel (datiscum1)
In Firebird-3.0.0.31632-Beta2 SuperServer the firebird server crash with:
debian7 Wed Feb 11 14:33:15 2015 /opt/fb30/bin/fbguard: /opt/fb30/bin/firebird terminated abnormally (-1)
and then be automatically restarted from the guardian process.
debian7 Wed Feb 11 14:33:15 2015 /opt/fb30/bin/fbguard: guardian starting /opt/fb30/bin/firebird
Now you need not manually restart the server.
But can you not completely avoid the crash?
Thanks, Sascha Michel
Commented by: Sascha Michel (datiscum1)
LI-T6.3.0.31719 Firebird 3.0 Beta 2
debian7 Fri Mar 20 11:57:33 2015 /opt/fb30/bin/fbguard: /opt/fb30/bin/firebird terminated abnormally (-1)
debian7 Fri Mar 20 11:57:33 2015 /opt/fb30/bin/fbguard: guardian starting /opt/fb30/bin/firebird
debian7 Fri Mar 20 11:58:10 2015 /opt/fb30/bin/fbguard: /opt/fb30/bin/firebird terminated abnormally (-1)
debian7 Fri Mar 20 11:58:10 2015 /opt/fb30/bin/fbguard: guardian starting /opt/fb30/bin/firebird
debian7 Fri Mar 20 11:58:15 2015 /opt/fb30/bin/fbguard: /opt/fb30/bin/firebird terminated abnormally (-1)
debian7 Fri Mar 20 11:58:15 2015 /opt/fb30/bin/fbguard: guardian starting /opt/fb30/bin/firebird
debian7 Fri Mar 20 12:01:07 2015 /opt/fb30/bin/fbguard: /opt/fb30/bin/firebird terminated abnormally (-1)
debian7 Fri Mar 20 12:01:07 2015 /opt/fb30/bin/fbguard: guardian starting /opt/fb30/bin/firebird
After one of these crashes, the database file was corrupt.
Commented by: Sascha Michel (datiscum1)
More precisely the error bounded !!!!
For testing only this small database is required.
SET SQL DIALECT 3; SET NAMES UTF8; CREATE DATABASE '192.168.1.2:/shares/DB/EventTest.fdb' USER 'SYSDBA' PASSWORD 'XXXXXXX' PAGE_SIZE 16384 DEFAULT CHARACTER SET UTF8 COLLATION UTF8;
CREATE TABLE "EventSync" ( "Action" INTEGER NOT NULL, PID INTEGER NOT NULL ); ALTER TABLE "EventSync" ADD CONSTRAINT "PK_EventSync" PRIMARY KEY ("Action");
INSERT INTO EVENTSYNC ("Action", PID) VALUES (1, -1); INSERT INTO EVENTSYNC ("Action", PID) VALUES (2, -1);
COMMIT WORK;
The processes are now synchronized with the table "Event Sync" This has some very interesting results. Whenever Firebird hangs the programm is unregistering the events "SIBfibEventAlerter1-> Registered = false;" Connect / Disconnect always works when no events are used.
So finally i will say that the unregistering events ist the problem !!!! Problem exists in FB2.5 / FB3.
Viode demonstration: https://datiscum.com/FirebirdEventProblem.wmv Test programm: https://datiscum.com/FB_EventTest.exe
I hope this is helpful, more I cannot do.
Detail of the test program with FIBPlus: Main Loop: try { do {
if \( \!IBCConnection\-\>Connected \)
\{
L\_ConnectStatus\-\>Caption = "Before connect";
LLockStatus\-\>Caption = "Lock \-\-\-";
Update\(\);
IBCConnection\-\>Connected = true;
L\_ConnectStatus\-\>Caption = "After connect";
Update\(\);
if \( ChBox\_RegEvents\-\>Checked \)
\{
if \( \!SIBfibEventAlerter1\-\>Registered \)
\{
LLockStatus\-\>Caption = "Lock wait";
Update\(\);
LockR: try { if ( !pFIBTransaction1->Active ) pFIBTransaction1->Active = true;
if \( \!pFIBDataSet\_EventSync\-\>Active \)
pFIBDataSet\_EventSync\-\>Open\(\);
pFIBDataSet\_EventSync\-\>Edit\(\);
pFIBDataSet\_EventSync\-\>FieldByName\("PID"\)\-\>AsInteger = Application\-\>Tag ;
pFIBDataSet\_EventSync\-\>FieldByName\("Action"\)\-\>AsInteger = 1;
pFIBDataSet\_EventSync\-\>Post\(\);
LLockStatus\-\>Caption = "Lock set";
Update\(\);
\}
catch\(\.\.\.\)
\{
goto LockR;
\}
L\_EventStatus\-\>Caption = "Before Register events";
Update\(\);
SIBfibEventAlerter1\-\>Registered = true;
L\_EventStatus\-\>Caption = "After Register events";
LLockStatus\-\>Caption = "Before lock release";
Update\(\);
if \( pFIBTransaction1\-\>Active \)
pFIBTransaction1\-\>Rollback\(\);
LLockStatus\-\>Caption = "Lock released";
\}
\}
\}
if \(IBCConnection\-\>Connected \)
\{
Sleep\(2\);
L\_ConnectStatus\-\>Caption = "Before disconnect";
LLockStatus\-\>Caption = "Lock \-\-\-";
Update\(\);
if \( SIBfibEventAlerter1\-\>Registered \)
\{
LLockStatus\-\>Caption = "Lock wait";
LockU: try { if ( !pFIBTransaction1->Active ) pFIBTransaction1->Active = true; if ( !pFIBDataSet_EventSync->Active ) pFIBDataSet_EventSync->Open();
pFIBDataSet\_EventSync\-\>Edit\(\);
pFIBDataSet\_EventSync\-\>FieldByName\("PID"\)\-\>AsInteger = Application\-\>Tag ;
pFIBDataSet\_EventSync\-\>FieldByName\("Action"\)\-\>AsInteger = 1;
pFIBDataSet\_EventSync\-\>Post\(\);
LLockStatus\-\>Caption = "Lock set";
\}
catch\(\.\.\.\)
\{
if \( pFIBTransaction1\-\>Active \)
pFIBTransaction1\-\>Rollback\(\);
goto LockU;
\}
L\_EventStatus\-\>Caption = "Before unregister events";
Update\(\);
SIBfibEventAlerter1\-\>Registered = false;
L\_EventStatus\-\>Caption = "After unregister events";
Update\(\);
/* LLockStatus->Caption = "Before lock release"; Update(); if ( pFIBTransaction1->Active ) pFIBTransaction1->Rollback(); LLockStatus->Caption = "After lock release"; Update(); */ } IBCConnection->Connected = false; L_ConnectStatus->Caption = "After disconnect"; Update(); } Application->ProcessMessages(); }while( AdvGlBu_Start->Down ); } catch(...) { AdvGlBu_Start->Down = false; }
Commented by: @AlexPeshkoff
Trying to run your test I get:
err:module:import_dll Library BTHPROPS.DLL (which is needed by L"Z:\\usr\\home\\firebird\\tests\\4680\\FB_EventTest.exe") not found err:module:import_dll Library BLUETOOTHAPIS.DLL (which is needed by L"Z:\\usr\\home\\firebird\\tests\\4680\\FB_EventTest.exe") not found
I see no reason for bluetooth libraries to be needed to check for FB problems.
Commented by: Sascha Michel (datiscum1)
Please download it again and it will work under linux wine. Please don't forget to copy a fbclient.dll in path or in wine system.
The other version will work under XP without the dll but wine has problems.
I have tested with wine 1.4.1 under Debian 7.
Commented by: @AlexPeshkoff
Ok, now it works for me. How many test iterations attempts should pass to have it reproduced?
Commented by: Sascha Michel (datiscum1)
I open the program 10 times and after 4.000 iterations the problem should occur.
Commented by: @AlexPeshkoff
Ten clients ran 4000 iterations each - and server did not hang. I suppose something is wrong with your environment.
Commented by: Sascha Michel (datiscum1)
I gave you the test version with a workaround for the problem. Sorry please test it with the current version.
As a workaround, I'd start a new thread before unregistering the events and this closes the application and the error does not occur.
Please download it again and test it under an native windows version and not wine.
Commented by: Sean Leyne (seanleyne)
Alex,
Was this case re-tested with Sascha's correct script version?
Commented by: @AlexPeshkoff
Sascha, trying to download the test I get:
********** Secure Connection Failed
An error occurred during a connection to http://datiscum.com. Peer's Certificate has been revoked. (Error code: sec_error_revoked_certificate)
The page you are trying to view cannot be shown because the
authenticity of the received data could not be verified. Please contact the website owners to inform them of this problem. Alternatively, use the command found in the help menu to report this broken site. **********
Is 'not wine' requirement hard? I have no ability to run it under windows. And I can't catch how can client OS affect server behaviour so much.
Commented by: Sascha Michel (datiscum1)
I have set up the new certificate.
I use Firebird in many small companies ( 5 User ) for more than 15 years. And also have not had a problem.
The more users ( 40 users ) are logged on at the same time, the higher the probability is that it occurs.
At my customer, I have discovered the problem by accident. I have killed a client and the Firebird server is again continued to run normally.
So I came up with the idea to write a small test tool. And it worked - the same problem in my office.
But I'm not saying that it is easy to provoke the http://state.My patch that kills the program automatically if it hangs during closing - works.
I hope that the issue can be resolved.
Commented by: @hvlad
I can confirm rare AV's in fbclient but i see no hung of server
Commented by: Sascha Michel (datiscum1)
How many clients you have started simultaneously?
What Firebird version you tested?
Commented by: Sascha Michel (datiscum1)
The Ticket CORE4952 SuperServer could hung on database shutdown !? Resolved !
This is the same problem ! And it was marked as fixed.
Thanks ;-)
Commented by: @hvlad
Sascha,
i reproduced two issues in fbclient using your test: - AV - deadlock
both issues related to the race condition in events handling code. We with Alex work on it and hope fix it soon.
Commented by: @hvlad
> How many clients you have started simultaneously? Your FB_EventTest.exe
> What Firebird version you tested? Current FB3 on Windows
Commented by: Sascha Michel (datiscum1)
I hope the problem can be fixed even in 2.5.5, before it is released.
Commented by: @hvlad
Sascha,
CORE4952 is FB3 specific and not related with this ticket.
Finally, i reproduced server hung up (and few other issues), thanks to your test application (4 instances in my case). Set of fixes is committed into trunk (v3), please test it. Some of them will be backported into v2.5.5 a bit later.
Commented by: Sean Leyne (seanleyne)
Vlad,
Please post some notes describing when the deadlocks could occur, to provide context for readers of this case.
Commented by: @hvlad
Sean,
the issue affects SuperServer and SuperClassic (i.e. not ClassicServer).
It could happen when application used events and connects\disconnects so often that OS re-assigns value of just recently closed socket to the newly accepted one between two calls of select() in Firebird listener thread.
Commented by: @hvlad
Patch is backported into v2.5.5
Commented by: Sascha Michel (datiscum1)
While it has been changed a bit, but the problem still exists.
I have now been tested with the version "LI V6.3.0.32136 Firebird 3.0 Release Candidate 1".
I've tested on Windows and Linux (Wine) The problem occurs even faster than before. A newer version of the testprogramm is here: https://datiscum.com/EventTest.zip
To understand the problem, the program must be started several times. Otherwise, there is no problem if it is only started once. 6 simultaneous program runs should be enough.
The server now no longer hangs, but shuts down itself.
Here is the log: FB30 Wed Nov 11 11:18:50 2015 INET/inet_error: invalid socket in packet_receive errno = 22
FB30 Wed Nov 11 11:18:50 2015 SRVR_multi_thread: shutting down due to unhandled exception
FB30 Wed Nov 11 11:18:50 2015 Unable to complete network request to host "FB30". Error reading data from the connection. Invalid argument
FB30 Wed Nov 11 11:18:50 2015 SRVR_multi_thread: forcefully disconnecting a port
FB30 Wed Nov 11 11:18:50 2015 Shutting down the server with 4 active connection(s) to 1 database(s), 0 active service(s)
FB30 Wed Nov 11 11:18:50 2015 /opt/firebird/bin/fbguard: /opt/firebird/bin/firebird normal shutdown.
Commented by: @hvlad
I can not reproduce it on Windows with 6 instances - it run few times for more than 5000 circles with no problem. I tried SS, 32 and 64 bit, RC1 and current snapshot.
PS Could you make your app to remember server\dbname entered by user ?
Commented by: Sascha Michel (datiscum1)
I uploaded a new version of FB_EventText.zip. It accepts commandline switched ( -u , -p , -d ). I hope that's OK.
I test again with a fresh local install of Firebird under windows. With an empty database there were no errors.
Then i create two database triggers: SET TERM ^^ ;
CREATE TRIGGER "Con1" ACTIVE ON CONNECT POSITION 0 AS begin post_event 'Test'; end ^^
CREATE TRIGGER "Discon1" ACTIVE ON DISCONNECT POSITION 0 AS begin post_event 'Test'; end ^^
SET TERM ; ^^
After that , the error i am described is back. When i disable the triggers the error is gone.
I hope that it is reproducible.
Thanks
e.g. It looks as if the Disconnect trigger causing the problem alone. But in the real DB i use a "commit trigger".
Commented by: @hvlad
> I uploaded a new version of FB_EventText.zip. > It accepts commandline switched ( -u , -p , -d ). I hope that's OK.
Thanks, it really helps.
The issue with > INET/inet_error: invalid socket in packet_receive errno = 22 is fixed now, see CORE5014
The issue with ON DISCONNECT trigger and POST_EVENT is fixed too, see CORE5015
Commented by: Sascha Michel (datiscum1)
Hello,
i tested again with 3.0.0.32179 and after 160.000 Connect/disconnets i got no server hang.
I got one server crash: FB30 Tue Nov 17 12:18:35 2015 /opt/firebird/bin/fbguard: /opt/firebird/bin/firebird terminated abnormally (-1) dmesg: firebird[31436]: segfault at 7f2100000000 ip 00007f218536e3ee sp 00007f218192d6e0 error 6 in http://libEngine12.so[7f2184f35000+62f000]
Then I tested it with Firebird 2.5.5.26952-0.amd64, but the server hangs after \~20.000 connects / disconnects.
Commented by: @pavel-zotov
@echo off setlocal enabledelayedexpansion enableextensions
set fbc=C:\MIX\firebird\fb30 set usr=SYSDBA set pwd=masterke set host=localhost set port=3333 set dbnm=C:\MIX\firebird\QA\fbt-repo\tmp\e30.fdb
del C:\MIX\firebird\QA\fbt-repo\tmp\e30.fdb 2>nul
set pbak=%path% set path=%fbc%;%pbak% set logp=logs md %logp% 2>nul
del %logp%\%\~n0.*.log 2>nul del %logp%\%\~n0.*.err 2>nul
set sql=%\~n0-run.tmp del %sql% 2>nul
set winq=20 if not .%1.==.. set winq=%1
set itercnt=10000 if not .%2.==.. set itercnt=%2
@echo on echo create database '%host%/%port%:%dbnm%' user '%usr%' password '%pwd%'; show database; | %fbc%\isql -q %fbc%\isql %host%/%port%:%dbnm% -user %usr% -password %pwd% -q -nod -i c4680-ddl.sql @echo off
@rem %fbc%\isql %host%/%port%:%dbnm% -user %usr% -password %pwd% -nod -q -x
( echo -- Generated auto, do NOT edit. echo set bail on; echo set heading off; for /l %%i in (1,1,%itercnt%) do ( echo connect '%host%/%port%:%dbnm%' user '%usr%' password '%pwd%'; select 'iter #%%i of %itercnt%', current_timestamp, current_connection from rdb$database; commit; ) ) >> %sql%
Correct this batch appropriately to your environment (change values of fbc/host/port/usr/pwd and dbnm variables) and run it with TWO arguments: arg #1 = number of ISQL sessions that shoudl be launched at the same time and make concurrent connect/disconnect statements; arg#2 = number of iterations that should contain 'dummy' SQL script that will be executed by this batch.
I use arg#1 = 30 and did not specified arg#2 (i.e. let it be default value = 10'000).
Note: every ISQL session will stop immediately its execution on any error due to 'set bail on' statement that is .added at the beginning of 'dummy' script.
Now what I've get.
=== I. Legacy ===
When LEGACY plugin is used for authentification, ISQL sessions will "self-stop" very quickly in case of using SS: after about 130 iterations. And this number (\~130) will be almost the same (+/- 2..3) for ALL these sessions.
(i.e. without line "-IProvider::attachDatabase failed when loading mapping cache").
It's interestingly (in Classic) that ONE isql session (of 30 launched) did not get exception and continued its work, but no longer any concurrent attachments.
As of SuperClassic, I could not get exception.
=== II. Srp. ===
When Srp plugin is used for authentification, ISQL sessions can work MUCH LONGER. On SS I got this only after \~1970 iteration (of 10'000 total). For CS and SC this test not yet launched.
Commented by: @pavel-zotov
PS. WI-V3.0.0.32179, OS = Win XP.
Commented by: @pavel-zotov
set term ^;
create or alter trigger trg\_disc active on disconnect position 0 as
begin
post\_event 'bar';
end
^
set term ;^
commit;
show trigger trg\_disc;
===
Commented by: @AlexPeshkoff
Sascha!
When you post this info:
I got one server crash: FB30 Tue Nov 17 12:18:35 2015 /opt/firebird/bin/fbguard: /opt/firebird/bin/firebird terminated abnormally (-1) dmesg: firebird[31436]: segfault at 7f2100000000 ip 00007f218536e3ee sp 00007f218192d6e0 error 6 in http://libEngine12.so[7f2184f35000+62f000]
it tells almost nothing to us about reasons of failure. Taking into an account that: - you can use win client + linux server - test run was about 160000 iterations the result you had is kind of unique, and it's very sad that we have no core dump of server for it. Can you take a look at http://www.ibphoenix.com/resources/documents/search/doc_36 and provide stack traces? In many cases they do have enough info for solving a problem.
Commented by: @pavel-zotov
> For CS and SC this test not yet launched.
SuperClassic has *no* such problem. All attachments works as long as I could wait - more than 3 hours and did not encounter exception.
CSPROG Wed Nov 18 17:23:08 2015 LockManager::LockManager() Database is probably already opened by another engine instance in another Windows session
CSPROG Wed Nov 18 17:26:40 2015 SERVER/process_packet: broken port, server exiting
Updated batch with some addition logging commands is in attached file.
PS. It's look strange but when database has fw = OFF (yes, mon$forced_writes = ZERO) then LESS exceptions occured comparing with fw = ON (mon$forced_writes = ONE).
Commented by: @hvlad
Pavel,
while your test allows to find and fix some real issues, it have *nothing common* with ticket subject. I.e. it is absolutely not related with events.
Commented by: @pavel-zotov
I've doubted about whether to create new ticket or no. But presence of db-level trigger (and only on DISconnect event) is mandatory for reproducing mentioned effect, so I've decided to put results here. If this was wrong, I can create new ticket and copy last messages there - tell me if this is needed.
Commented by: @pavel-zotov
- ?
Commented by: @hvlad
Pavel,
> What about these messages that was in Classic (see my posts above):
So far i could speak about 3 different issues showed by your test:
- race condition when last attachment is releasing database object and new attachment arrives at unhappy moment this one i fixed recently
- another AV related with run-time stats handling in SS mode it will be fixed soon
- issue with shared memory initialization, CS mode is not reproduced so far
None of them have relation with events, believe me or not.
Commented by: @pavel-zotov
> None of them have relation with events, believe me or not.
I'm not speaking about EVENTS as reason of AV (and did not think about them :-)). These issues relate to DISCONNECT trigger only, am I correct ?
Commented by: @hvlad
That AV's have no relation with ON DISCONNECT TRIGGER.
Commented by: @hvlad
Pavel,
issues with shared memory initialization in CS mode should be fixed now. Please, test tomorrows snapshot build.
Submitted by: Sascha Michel (datiscum1)
Is related to CORE5015 Is related to CORE5014 Is related to CORE5017
Attachments: screenshot-1.jpg c4680-test-batch.zip
Approximately every month i see a "firebird superserver" that stalled. I must restart the server with the init script , every connect to the server is impossible. Firebird offers the following message in the log file:
Koenig_DB_server (Server) Tue Oct 21 10:47:50 2014 Shutting down the server with 26 active connection(s) to 1 database(s), 0 active service(s)
Koenig_DB_server (Server) Thu Nov 13 07:50:54 2014 Shutting down the server with 16 active connection(s) to 1 database(s), 0 active service(s)
Koenig_DB_server (Server) Thu Jan 15 09:42:31 2015 Shutting down the server with 30 active connection(s) to 1 database(s), 0 active service(s)
Koenig_DB_server (Server) Fri Jan 30 10:24:50 2015 Shutting down the server with 27 active connection(s) to 1 database(s), 0 active service(s)
Since it was years long so goes, I have written a little program, with which I was able to replicate the problem.
1 .A connection is established to the database server 2. 9 events registered. 3. Waited a short time ( millisecond ) 4. Unregister the events. 5. Connection closed. 6. Start all over again.
If it is possible to kill the process that caused the deadlock, the server continues to run normal. If no events were registered it does not happen.
I was able to reproduce the problem under "super server firebird 2.5 / 3beta2 linux / windows".
Here's a video that shows that problem. https://datiscum.com/FB_Test_x264.mp4
The program can be downloaded here. https://datiscum.com/FirebirdTest.7z
The downloads are available only for a few days.
I hope this was helpful and the problem can be eliminated.
Regards, Sascha Michel
Commits: FirebirdSQL/firebird@a6c6d15daa4555cbfb53632e5b36c16653e6033f FirebirdSQL/firebird@f7e7c513a1b79cc0d59015676c8d8bf1191015cf FirebirdSQL/firebird@6a85b46ea701d112583aa7989bf92f8c722999df FirebirdSQL/firebird@a6d615c0cd6905a03318bb9b9d8a7625c08eb68a FirebirdSQL/firebird@ed1e5e65a5b4fbcb05489c0f5cb9744fd3d54051 FirebirdSQL/firebird@fc4063fa160e53a61c222fe3bad0aa34808ba592 FirebirdSQL/firebird@4068a225b45dbace1f28c2092642dd333ee6d116 FirebirdSQL/firebird@2ab15d1e5f34240f2b3d42517d49ae040ac246c0 FirebirdSQL/fbt-repository@f8758da9cba37741e8ae3e7915346f70fbd9d76d FirebirdSQL/fbt-repository@311169807bcca7f88d9545283b1e707e5073e3bb FirebirdSQL/fbt-repository@a70c1f384428d9defbc9bdfc5b773c2c7d8691a7 FirebirdSQL/fbt-repository@784b47d1308c264933bfc984c723619ba54437a5 FirebirdSQL/fbt-repository@7ee5da0530c82150b0a833d793e161b3bac69428 FirebirdSQL/fbt-repository@bfbf5a82384293e92871861ce5aa140e5a5c8383
====== Test Details ======
Decided skip implementation after letter from hvlad, 28.12.2017 12:33.