Closed firebird-automations closed 13 years ago
Commented by: @hvlad
INET/inet_error: accept errno = 10038
>>>> MSDN WSAENOTSOCK 10038
Socket operation on nonsocket.
An operation was attempted on something that is not a socket\. Either the socket handle parameter did not reference a valid socket, or for select, a member of an fd\_set was not valid\.
>>>> MSDN
As error was found at call of accept() then we have bad listener socket. Don't ask me why and how it became wrong. Firebird able to detect such condition and to remove bad socket from internal list (correctly closing connection of course). Therefore next message :
INET/select_wait: found "not a socket" socket : 504
504 is a numeric value of bad socket.
But this ability seems not ready to deal with listener socket (all known to me cases was about worker sockets) and bad socket not removed from list and network server enters and endless loop. This is error i going to fix.
As for corrupted indices - we know you have a lot of indices so no wonder some of them was corrupted after "stop hardly the firebird process". Anyway, i would like to look at that part of firebird.log with corruption errors.
Commented by: vander clock stephane (arkadia)
thanks Vlad,
>> But this ability seems not ready to deal with listener socket (all known to me cases was about worker sockets) and bad socket not removed from list and network server enters and endless loop. This is error i going to fix.
great !
>> As for corrupted indices - we know you have a lot of indices so no wonder some of them was corrupted after "stop hardly the firebird process".
yes it's possible, but the corrupted index is often on our database (every 2/3 weeks). the problem is that to check the database we must fully stop the server to run the gstat and gstat take few hours all the time to run. so most of the time we detect the corrupted index when the server is "over" and no other choice that fully stop our services, and we use this time to run gstat ...
>> Anyway, i would like to look at that part of firebird.log with corruption errors. aie, the firebird.log was so big after this bug that we was forced to delete it. but all the row in it was the same, because firebird server was always adding theses rows :
DATABASESERVER Sun Aug 29 05:02:59 2010 INET/inet_error: accept errno = 10038 DATABASESERVER Sun Aug 29 05:02:59 2010 INET/select_wait: found "not a socket" socket : 504
but after kill the firebird process (by stopping the service) and restart it, the firebird was working ok !
Commented by: vander clock stephane (arkadia)
>> But this ability seems not ready to deal with listener socket (all known to me cases was about worker sockets) and bad socket not removed from list and network server enters and endless loop. This is error i going to fix.
Is this fix in the last release of Firebird 2.5 ?
Commented by: @hvlad
Fix is still not implemented, sorry. Is it bother you regularly ?
Commented by: vander clock stephane (arkadia)
Thanks Vlad,
Yes, it's crash again just this morning. i thing that we can say it's happen 1 time a month in average, but when it's happen everything is down :(
this morning i have
DATABASESERVER Sun Aug 29 05:02:59 2010 INET/select_wait: found "not a socket" socket : 508
and last time it's was
DATABASESERVER Sun Aug 29 05:02:59 2010 INET/select_wait: found "not a socket" socket : 504
but except this (508 instead of 504) same scenario : very big firebird.log file growing and growning
Commented by: Artem Kuzmenko (artyom-ace)
I have crash today with this bug. Log size and content surprise me! I attach log to message. DB after crash don't have a bugs. I stop server by Firebird Server Control, but started only after reboot. OS. Win 2003 R2 Enterprise SP2 32bit Firebird 2.5.0.26054, default install P.S. all DB use "execute statement on external" inside this server ...
Commented by: Artem Kuzmenko (artyom-ace)
I try by oneself find regularity. I find it. On my Win2003R2 (I install lastest version) Firebird 2.5.0.26074 server contain 3 DB with ODS 11.2. Server have many outer connection. And if one outer PC with firebird 2.1 (2.1.3.18185) connected to firebird 2.5 server, all ok, if outer PC with firebird 2.1 2 and more I have this crash.
Rus: Как смог так и описал на английском, повторю на русском.
Сервер с установленным Firebird 2.5.0.26074 содержит базы с ODS 11.2 + используют внутри данного сервера "execute statement on external", на всякий случай привожу это вдруг это важно. Если внешнее TCP соединение приходит от компа с установленным firebird 2.1 то как правило это соединение проходит и все ок (даже если несколько программ на этом компьютере, в моем случае 3 нормально работали), как только имею 2 и более соединения с сервером 2.5 с разных клиентских машин где стоит 2.1 начинаются глюки, или намертво виснет клиентское приложение (в лучшем случае) или падает с данной ошибкой сервер 2.5.
Ну это мои наблюдения, надеюсь это быстро поможет устранить данную досадную ошибку. Т.к. база на 2.5 рабочая и откатиться к 2.1 уже возможности нет :(
Commented by: @hvlad
Artem,
feel free to contact me privately to figure out all details
Commented by: @hvlad
Stephane, Artem,
answer few questions, please:
a) do you have any antivirus or firewall software installed at the host where Firebird server is running ? b) how many connections established at time when error happens ? c) could you run netstat -p tcp -n at time when error happens and post results here ?
Commented by: vander clock stephane (arkadia)
a) do you have any antivirus or firewall software installed at the host where Firebird server is running ? => NO, absolutely nothing, windows 2008 R2 64 bit
b) how many connections established at time when error happens ? => i don't really know, but around 100 ?
c) could you run netstat -p tcp -n at time when error happens and post results here ? => i will wait the next time the error happen and do it
Commented by: @hvlad
and one more question: d) do you have connections using "localhost ", i.e. local TCP connections ?
Commented by: Artem Kuzmenko (artyom-ace)
Last fiew days I try to provoke a bug. On working system (where it's hapen regularly) all firebird reinstall up to last version. I don't have a choice.
I create Bug Generator :) : 4 virtual mashines with OS, Prog and attribute as at old working system. But without effect so far :(
a) do you have any antivirus or firewall software installed at the host where Firebird server is running ? => Have installed Kaspersky 6 for Server. Gug happen with on and off kaspersky. But it not uninstalled yet.
b) how many connections established at time when error happens ? d) do you have connections using "localhost ", i.e. local TCP connections ? => around 10 on working system. But on my notebook, where I Develop my soft, yesterday firebird down whish this bug localy! (first time, log saved). firebird haved few droped connections and may be one normal. Server down only in moment when i try connect to db. Interesting that log grow up speed proportionally CPU speed.
On my notebook installed KIS9 but it work only when i start it manualy. As usual it off.
When I can stable generate bug or if find new fact I immediately inform you.
Commented by: @hvlad
Artem, are you still trying to reproduce it ?
Commented by: Artem Kuzmenko (artyom-ace)
Sorry to many work :( Few times I try to reproduce bug on 5 VMWare virtual mashines but without effect :( In my company after reinstalled all client and server to last fb 2.5 I don't see this error.
I Still dependence that guilty of bug is connection from fb 2.0 or 2.1 installed on client ...
Commented by: vander clock stephane (arkadia)
dear vlad,
hmmm, it's a lot of time that this bug not appear ... these kind of bug are very very hard to track. actually i m fighting with windows to be able to have a dump when the firebird process crash. i found a way, so probably i will write it somewhere is someone else need to do it ?
Commented by: @hvlad
Stephane,
of course, it could be helpful for others if you found a way to produce crash dumps :) BTW, if you have such dump - send it to me, please (or make available for download)
Commented by: Artem Kuzmenko (artyom-ace)
Yes! I Did it!!! I can crash system with this bug at any time. Please tell me what I have to do that you have maximum info about bug step by step.
It's happens when my prog connect to 3 DB on server with FB25 from clients mashine on 3 step: 1. Run prog and connect from 2.5 client - ok 2. Run prog and connect from 2.5 client - ok 3. Run prog and try connect from 2.1 client - prog stick (may be few times) 4. ... few attempt run prog and connect from 2.1 client and server crash.
Commented by: @hvlad
Artem,
could you send me by e-mail all necessary files (program and db) with instructions how to reproduce bug, please ? Or make it awailable for download and send me URL
Commented by: vander clock stephane (arkadia)
I DO IT TOO !!! but in different way more easy i thing :)
I install the last version of FB 2.5 on the server. on the client the last version of the FB 2.5 fbclient DLL too (so it's not connected to the version of the DLL)
Important: on the server i set the firewall ON except for the port 3050 of firebird (This to block the port used by the event)
and after easy, on the client side i simply launch an "Event" listener process :) wait 1 or 2 connecting error and the fbserver start to take 100% of the CPU and wite in loop in the firebird.log !
This was not the condition it's was on our production server (because on it the firewall is open for the event) but it's a 100% working way to simulate the bug !
attached find my software demo compiled (in delphi) of an event listener Application. very easy to setup :)
Commented by: vander clock stephane (arkadia)
the demo application to create an event listener thread
the code source :
///////////////////////////// ///// TALFBXEventThread ///// /////////////////////////////
{********************************************************} {!!we guess that this procedure will be not multithread!! but we have a strange bug when Fsignal is TEvent, when we disconnect the FBserver, them an EaccessViolation in ntdll is raise in the waitfor in the execute function} procedure ALFBXEventCallback(UserData: Pointer; Length: Smallint; Updated: PAnsiChar); cdecl; begin if (Assigned(UserData) and Assigned(Updated)) then begin with TALFBXEventThread(UserData) do begin if FEventCanceled then begin SetEvent(FSignal); Exit; end; Move(Updated^, fResultBuffer^, Length); FQueueEvent := True; SetEvent(FSignal); end; end else begin //if Updated = nil then it's look like it's an error //like connection lost for exemple or a call to EventCancel with TALFBXEventThread(UserData) do begin if FEventCanceled then begin SetEvent(FSignal); Exit; end; FQueueEvent := False; SetEvent(FSignal); end; end; end;
{***************************************************} procedure TALFBXEventThread.initObject(aDataBaseName, aLogin, aPassword, aCharSet: String; aEventNames: String; aConnectionMaxIdleTime: integer; aNumbuffers: integer; aOpenConnectionExtraParams: String); Var aLst: TStrings; i: integer; begin //if we put lower than tpNormal it seam than the <//EventThread.Free> will never return ! //Priority := tpNormal; FreeOnTerminate := False; FConnectionMaxIdleTime := aConnectionMaxIdleTime; if FConnectionMaxIdleTime <= 0 then FConnectionMaxIdleTime := INFINITE; FDBHandle := nil; FQueueEvent := False; fResultBuffer := Nil; FSignal := CreateEvent(nil, true, false, ''); fcompleted := False; fStarted := False; FEventCanceled := False; FWaitingSignal := False; FDataBaseName:= aDataBaseName; FCharset:= ALFBXStrToCharacterSet(aCharSet); fOpenConnectionParams := 'user_name = '+aLogin+'; '+ 'password = '+aPassword+'; '+ 'lc_ctype = '+aCharSet; if aNumbuffers > -1 then fOpenConnectionParams := fOpenConnectionParams + '; num_buffers = ' + inttostr(aNumbuffers); if aOpenConnectionExtraParams <> '' then fOpenConnectionParams := fOpenConnectionParams + '; ' + aOpenConnectionExtraParams; aLst := TstringList.Create; Try Alst.Text := Trim(alStringReplace(aEventNames,';',#13#10,[rfReplaceALL])); i := 0; while (i <= 14) and (i <= Alst.Count - 1) do begin fEventNamesArr[i] := Trim(Alst[i]); inc(i); end; fEventNamesCount := i; while i <= 14 do begin fEventNamesArr[i] := ''; inc(i); end; Finally Alst.Free; End; end;
{*************************************************} constructor TALFBXEventThread.Create(aDataBaseName, aLogin, aPassword, aCharSet: String; aEventNames: String; // ; separated value like EVENT1;EVENT2; etc... aApiVer: TALFBXVersion_API; const alib: String = GDS32DLL; const aConnectionMaxIdleTime: integer = -1; const aNumbuffers: integer = -1; const aOpenConnectionExtraParams: String = ''); begin fLibrary := TALFBXLibrary.Create(aApiVer); fLibrary.Load(alib); FownLibrary := True; initObject(aDataBaseName, aLogin, aPassword, aCharSet, aEventNames, aConnectionMaxIdleTime, aNumbuffers, aOpenConnectionExtraParams); inherited Create(False); // see http://www.gerixsoft.com/blog/delphi/fixing-symbol-resume-deprecated-warning-delphi-2010 end;
{*************************************************} constructor TALFBXEventThread.Create(aDataBaseName, aLogin, aPassword, aCharSet: String; aEventNames: String; // ; separated value like EVENT1;EVENT2; etc... alib: TALFBXLibrary; const aConnectionMaxIdleTime: integer = -1; const aNumbuffers: integer = -1; const aOpenConnectionExtraParams: String = ''); begin fLibrary := alib; FownLibrary := False; initObject(aDataBaseName, aLogin, aPassword, aCharSet, aEventNames, aConnectionMaxIdleTime, aNumbuffers, aOpenConnectionExtraParams); inherited Create(False); // see http://www.gerixsoft.com/blog/delphi/fixing-symbol-resume-deprecated-warning-delphi-2010 end;
{********************************************} procedure TALFBXEventThread.AfterConstruction; begin inherited; while (not fStarted) do sleep(10); end;
{***********************************} destructor TALFBXEventThread.Destroy; begin
//first set terminated to true If not Terminated then Terminate;
//in case the execute in waiting fire the Fsignal while (not fWaitingSignal) and (not fCompleted) do sleep(10); if (not fCompleted) then setEvent(FSignal); while (not fCompleted) do sleep(10); //sleep(100); => i don't know the purpose of this so i comment it !
//close the fSignal handle CloseHandle(FSignal);
//free the library if FownLibrary then fLibrary.Free;
//destroy the object inherited;
end;
{**********************************} procedure TALFBXEventThread.Execute; var aEventBuffer: PAnsiChar; aEventBufferLen: Smallint; aEventID: Integer; aStatusVector: TALFBXStatusVector;
\{\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\}
Procedure InternalFreeLocalVar;
Begin
//free the aEventID
if aEventID <\> 0 then begin
FEventCanceled := True;
Try
ResetEvent\(Fsignal\);
FLibrary\.EventCancel\(FDbHandle, aEventID\);
//in case the connection or fbserver crash the Fsignal will
//be never signaled
WaitForSingleObject\(FSignal, 60000\);
Except
//in case of error what we can do except suppose than the event was canceled ?
//in anyway we will reset the FDbHandle after
End;
FEventCanceled := False;
end;
aEventID := 0;
//free the aEventBuffer
if assigned\(aEventBuffer\) then begin
Try
FLibrary\.IscFree\(aEventBuffer\);
Except
//paranoia mode \.\.\. i never see it's can raise any error here
End;
end;
aEventBuffer := nil;
//free the FResultBuffer
if assigned\(FResultBuffer\) then begin
Try
FLibrary\.IscFree\(FResultBuffer\);
Except
//paranoia mode \.\.\. i never see it's can raise any error here
End;
end;
FResultBuffer := nil;
//free the FDBHandle
if assigned\(FDBHandle\) then begin
Try
FLibrary\.DetachDatabase\(FDBHandle\);
Except
//yes the function before can do an exception if the network connection
//was dropped\.\.\. but not our bussiness what we can do ?
End;
end;
FDBHandle := Nil;
//ok, if we remove the instruction below then sometime, when we close
//the program we can have an eAcessViolation\. to see it simply run
//a program to run and imediatly close and have some delay/sleep
//in other unit \(3seconds it's enalfe\)\. Run Winreguardian \-nothingtolaunch
//for exemple
//sleep\(100\);
End;
var aCurrentEventIdx: integer; aMustResetDBHandle: Boolean; begin //to be sure that the thread was stated fStarted := True;
aEventBuffer := nil; aEventID := 0; aEventBufferLen := 0; aMustResetDBHandle := True;
while not Terminated do begin Try
//if the DBHandle is not assigned the create it
//FDBHandle can not be assigned if for exemple
//an error \(disconnection happen\)
if aMustResetDBHandle then begin
//set the FMustResetDBHandle to false
aMustResetDBHandle := False;
//free the local var
InternalFreeLocalVar;
//First init FDBHandle
FLibrary\.AttachDatabase\(FDataBaseName,
FDBHandle,
fOpenConnectionParams\);
//register the EventBlock
aEventBufferLen := FLibrary\.EventBlock\(aEventBuffer,
fResultBuffer,
fEventNamesCount,
PAnsiChar\(fEventNamesArr\[0\]\),
PAnsiChar\(fEventNamesArr\[1\]\),
PAnsiChar\(fEventNamesArr\[2\]\),
PAnsiChar\(fEventNamesArr\[3\]\),
PAnsiChar\(fEventNamesArr\[4\]\),
PAnsiChar\(fEventNamesArr\[5\]\),
PAnsiChar\(fEventNamesArr\[6\]\),
PAnsiChar\(fEventNamesArr\[7\]\),
PAnsiChar\(fEventNamesArr\[8\]\),
PAnsiChar\(fEventNamesArr\[9\]\),
PAnsiChar\(fEventNamesArr\[10\]\),
PAnsiChar\(fEventNamesArr\[11\]\),
PAnsiChar\(fEventNamesArr\[12\]\),
PAnsiChar\(fEventNamesArr\[13\]\),
PAnsiChar\(fEventNamesArr\[14\]\)\);
//the First EventQueue
ResetEvent\(Fsignal\);
FLibrary\.EventQueue\(FdbHandle,
aEventID,
aEventBufferLen,
aEventBuffer,
@ALFBXEventCallback,
self\);
if WaitForSingleObject\(FSignal, 60000\) <\> WAIT\_OBJECT\_0 then raise Exception\.Create\('Timeout in the first call to isc\_que\_events'\);
FLibrary\.EventCounts\(aStatusVector,
aEventBufferLen,
aEventBuffer,
fResultBuffer\);
//set the FQueueEvent to false in case the next
//WaitForSingleObject fired because of a timeout
FQueueEvent := False;
//the 2nd EventQueue
ResetEvent\(Fsignal\);
FLibrary\.EventQueue\(FdbHandle,
aEventID,
aEventBufferLen,
aEventBuffer,
@ALFBXEventCallback,
self\);
end;
//if terminated then exit;
if Terminated then Break;
//set fWaitingsignal
fWaitingsignal := True;
//stop the thread stile a event appear
WaitForSingleObject\(FSignal, FConnectionMaxIdleTime\); //every 20 minutes reset the connection
//set fWaitingsignal
fWaitingsignal := False;
//if terminated then exit;
if Terminated then Break;
//if an event was set
if \(FQueueEvent\) then begin
//retrieve the list of event
FLibrary\.EventCounts\(aStatusVector,
aEventBufferLen,
aEventBuffer,
fResultBuffer\);
//if it was the event
for aCurrentEventIdx := 0 to 14 do
if aStatusVector\[aCurrentEventIdx\] <\> 0 then onEvent\(fEventNamesArr\[aCurrentEventIdx\],aStatusVector\[aCurrentEventIdx\]\);
//reset the FQueueEvent
FQueueEvent := False;
//start to listen again
ResetEvent\(Fsignal\);
FLibrary\.EventQueue\(FdbHandle,
aEventID,
aEventBufferLen,
aEventBuffer,
@ALFBXEventCallback,
self\);
end
//it must be an error somewhere
else aMustResetDBHandle := True;
Except
on E: Exception do begin
//Reset the DBHandle
aMustResetDBHandle := True;
OnException\(E\);
end;
End;
end;
Try //free the local var InternalFreeLocalVar; Except on E: Exception do begin OnException(E); end; End;
//set completed to true //we need to to this because i don't know why //but on isapi the waitfor (call in thread.free) //never return. //but i don't remenbered if the free was call in the initialization //section of the ISAPI DLL (and that bad to do something like this //in initialization or finalization). fcompleted := True; end;
Commented by: @dyemanov
Sounds similar to CORE3170.
Commented by: @hvlad
No, it is different bug. I'm already testing patch and hope to commit it soon.
Commented by: vander clock stephane (arkadia)
Vlad, i lost the email you send me about the result of the test on the new version you have done. actually it's ok, it's not raise the Exception BUT i do the test only on our beta server without a true activity on him. but as this bug was simple to reproduce (when we know the raison) i thing now is ok !
Commented by: Ann Lynnworth (annfire)
I also had this problem, but I could recreate it within a few seconds. The symptom was that the client would hang with an ISC disconnect error message.
ISC ERROR CODE:335544721
ISC ERROR MESSAGE: Unable to complete network request to host "(snip)". Failed to establish a connection.
Meanwhile the server side would accumulate a giant log file (larger than 33 GB) with endless repetition of these two:
FB101 (Server) Mon May 30 01:25:05 2011 INET/select_wait: found "not a socket" socket : 536
FB101 (Server) Mon May 30 01:25:05 2011 INET/inet_error: accept errno = 10038
To give some context and extra keywords: I was testing IBObjects replication, which uses events. Activating the replication triggered the myriad problems (often including Firebird crashing).
As Firebird server v2.5.1 (which supposedly fixes this issue) is not available, a workaround may be of interest to other firebird admins. It is obvious in retrospect. (a) Edit firebird.conf and set a fixed port for events, e.g. 3051. Restart Firebird service. (b) Change the firewall rules to allow traffic on that port, limited by ip number etc as relevant. Once the firewall allows traffic on the fixed event port, replication works (yes, the app no longer hangs).
Commented by: @hvlad
Ann,
> As Firebird server v2.5.1 (which supposedly fixes this issue) is not available
are you aware of daily snapshot builds ?
Submitted by: vander clock stephane (arkadia)
Attachments: ALFBXEvent.zip firebird.rar
Votes: 1
theses bug are really hard to reproduce or to understand what make them happen. i can only say what we see
The database server stop to answer all the clients. in the firebird.log we have this
DATABASESERVER Sun Aug 29 04:53:57 2010 INET/inet_error: read errno = 10054 DATABASESERVER Sun Aug 29 04:56:59 2010 INET/inet_error: read errno = 10054 DATABASESERVER Sun Aug 29 04:58:53 2010 INET/inet_error: read errno = 10054 DATABASESERVER Sun Aug 29 05:01:27 2010 INET/inet_error: read errno = 10054 DATABASESERVER Sun Aug 29 05:02:59 2010 INET/inet_error: accept errno = 10038 DATABASESERVER Sun Aug 29 05:02:59 2010 INET/select_wait: found "not a socket" socket : 504 DATABASESERVER Sun Aug 29 05:02:59 2010 INET/inet_error: accept errno = 10038 DATABASESERVER Sun Aug 29 05:02:59 2010 INET/select_wait: found "not a socket" socket : 504 DATABASESERVER Sun Aug 29 05:02:59 2010 INET/inet_error: accept errno = 10038 DATABASESERVER Sun Aug 29 05:02:59 2010 INET/select_wait: found "not a socket" socket : 504 DATABASESERVER Sun Aug 29 05:02:59 2010 INET/inet_error: accept errno = 10038 DATABASESERVER Sun Aug 29 05:02:59 2010 INET/select_wait: found "not a socket" socket : 504
... and like this for more than 19 go ! the firebird.log was always growing by adding all the time these lines :
DATABASESERVER Sun Aug 29 05:02:59 2010 INET/inet_error: accept errno = 10038 DATABASESERVER Sun Aug 29 05:02:59 2010 INET/select_wait: found "not a socket" socket : 504
even after we close/kill all the client connected to the server ! we was force to stop hardly the firebird process ...
after launch a Gstat on the database, we see that lot of index was corrupted (around 10) in different tables
Actually it's still impossible to run the firebird server for more than 2 weeks without having a probleme that in all case result in a corrupted database...
Commits: FirebirdSQL/firebird@1e35bc97c8cf704900c63480f63d3a1a6048d246 FirebirdSQL/firebird@90b88fdec327a1e58dba086f2c2e89c6a0ea58b5 FirebirdSQL/firebird@b48821ac022eeeaa2c70865255639666bf2db952