Open andrewrimmer opened 1 year ago
Whilst this issue often is observed at session end, it looks like the issue isn't directly related.
It looks like Quickfixn gets into trouble during the day, and stops accepting new connections.
When netstat is run there are perhaps 150-200 CLOSE_WAIT connections. Could this be what is preventing new connections?
When we restart the FIX application, it starts behaving correctly until the next time.
Any suggestions on how we could debug the issue?
We run a pretty vanilla configuration, and use the SSL support built into quickfixn. When we used the old C++ wrapper, we used stunnel for SSL. We could try this to rule out any of the SSL pipeline/processing.
Any thoughts @gbirchmeier @mgatny?
@gbirchmeier - We also run into this issue often
@ririvas may I ask what version of quickfixn you are using? Is it heavily locked down with a firewall?
@andrewrimmer
<PackageReference Include="QuickFix.Net.NETCore" Version="1.8.1" />
[Edit from @gbirchmeier: The above package is an unauthorized release created by a third party. Official QF/n packages start with "QuickFIXn."]
DataDictionary=./spec/fix/FIX44.xml
Our engine sits separately from our internal network but does have strict networking rules. It does sit on an Azure VM.
@ririvas thanks a lot for sharing more info. We ourselves have recently placed further firewall restrictions on how accessible our FIX server is. We are now waiting to see if that has any effect. If you are pretty locked down yourselves, then it maybe won't really help.
Do you use the SSL layer in quickfixn or is yours separate?
We are using the library in a pretty standard & simple way, and the stability issues have only occurred since we ported from the old quickfix .net wapper (over the C++ version) to quickfixn. We did use stunnel as part of that legacy solution, which worked fine.
It would be fantastic to get to the bottom of the issues.
@andrewrimmer - We use the SSL layer in quickfixn but our counterparty uses stunnel on their end. And we're having a hard time reproducing the issue internally.
@ririvas we have only observed this behaviour in production, and cannot reproduce internally.
@andrewrimmer Have you tried using stunnel instead of the built-in SSL support? We may consider that next.
Other notes
Thanks @ririvas.
Yeah, our next step would be switching back to stunnel to rule out the SSL layer.
We use the FileStore and our own elasticsearch log store.
We haven't noticed the issue reoccurring since we improved the security around the endpoint availability over the internet. That could be a coincidence as we sometimes go for weeks without issues, and then may get several consecutive days of issues.
To improve health reporting, we also have a monitor continually checking you can ping/telnet to the fix server port(s).We have observed that the fix server will stop accepting new connections (cannot telnet/psping) but any existing sessions/connections are fine. In this state the problem would tend to occur if, in this state, we get a new connection attempt or the session end/start causes reconnections. In the case it is completely broken. A full application restart fixes the issues, until the next time.
Hey @andrewrimmer - we saw that the quickfixn code will catch errors and output to console. We weren't seeing these messages until we redirect console output. Now we can see the following
Error accepting connection: Received an unexpected EOF or 0 bytes from the transport stream.
Assume it's coming from the line below
Now we'll see if we can identify a cause with this message
Did either of you get closer to root-causing this?
We tried going back from using the latest version to the last stable version but this did not resolve the issues. We stopped the SSL provision in QuickFIX and now have stunnel in between the clients and our FIX server. This workaround stopped the issue from occurring.
On Wed, 5 Jun 2024 at 21:20, Grant Birchmeier @.***> wrote:
Did either of you get closer to root-causing this?
— Reply to this email directly, view it on GitHub https://github.com/connamara/quickfixn/issues/763#issuecomment-2150897264, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK6HQJ7K5632VMCO76RPPDZF5XHJAVCNFSM6AAAAAAWC2RVV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQHA4TOMRWGQ . You are receiving this because you were mentioned.Message ID: @.***>
So it sounds like this issue is specific to Acceptors that use the built-in SSL connectivity. Thanks, that is very helpful.
We have an intermittent issue in production, which causes quickfixn to stop working around the end of the session. The fix server does not respond to connection attempts, and the fix sessions are not created the following day. We have to restart our application for things to start working again.
I am not seeing any clues in our logs, and we are not getting any exceptions reported.
At the end of the session we get the following messages/events: -
Then on a day it is working, we will see connection/logon attempts repeatedly until the sessions open again.
However, if it is not working we will see nothing else in the logs after end time. At this stage we have to restart the application to get things working again.
We are using a version of quickfixn from the master branch of September 2022.
Any idea how we would go about troubleshooting an issue like this?
Is there any more logging we can enable, to perhaps see what might be occurring internally.
If quickfixn throws an exception, how would we log and surface this?
We are using a pretty standard setup of quickfix, with the in-built SSL, on .net framework running on windows server. We haven't encountered this problem in dev or test environments, but we have less variety of connections coming in.
Any help, greatly appreciated.