Closed nopara73 closed 6 years ago
@NicolasDorier I would like to fix this but i need some help. The point is that i cannot configure the @nopara73's test application in order to get the logs, it seems System.Diagnostics classes cannot be configured in dotnet core 2.0 (take a look at this and this). Troubleshooting without logs is a bit hard, i am blind dealing with this.
Do you have any recomendation? I started writting another logging mechanism to replace System.Diagnostics classes but it doesn't look right. (Just as a side note: dotnet core doesn't provide a console listener neither - i can be worng)
As alternative strategies I have a few things I can think of:
I think NBitcoin should use ILogger from .NETStandard. I am unsure when they started doing it.
This ILogger should be injected into the NodeConnectionParameters
. You just need to make sure that builds which does not support ILogger still compile though.
I will take a look to improve logging on NBitcoin.
@nopara73 right now nbitcoin is using AppVeyor. I am not against adding travis support with running all tests on mac os as well.
Add a way to extend log right now...
@lontivero can you use this https://github.com/MetacoSA/NBitcoin/commit/f1dcd9a371cecdbcc7a7f1862c409c9bacc27e36 I added an extension point for logs for environment without TraceSource.
I need to eventually drop TraceSource completely.
@dev0tion if I remember well you are using OSX. Did you ever experienced it? @dangershony @bokobza @Aprogiena ? This is a critical problem, how come we never encountered this before? Even downgrading NBitcoin fails: https://github.com/nopara73/NBitcoinBitcoinP2pOsxBug/pull/1
Oh as I see Stratis excludes these tests from CI. Is it possible it never even worked? https://github.com/stratisproject/StratisBitcoinFullNode/blob/master/build.sh
# exclude integration tests
if [[ "$testProject" == *"Integration.Tests"* ]] || [[ "$testProject" == *"IntegrationTests"* ]] || [[ "$testProject" == *"NBitcoin.Tests"* ]] ; then
continue
fi
@nopara73 I fixed a bug in the netcore version. You can now inject a logger through TraceSourceFactory.CreateTraceSourceFactory
at the start of your app. Can you do this ?
By default it is this one, but you can make your own to print into console logs of NBitcoin.
class NullTraceSource : TraceSource
{
string n;
public NullTraceSource(string n)
{
this.n = n;
}
public Switch Switch
{
get;
set;
} = new Switch();
public void TraceEvent(TraceEventType traceEventType, int eventId, string msg, object[] args)
{
}
public void TraceEvent(TraceEventType traceEventType, int eventId, string msg)
{
}
public void TraceInformation(string msg)
{
}
public void TraceTransfer(int eventId, string p2, Guid activity)
{
}
}
Logging was @lontivero's request. @lontivero Can you progress with this logging? I am not sure it helps me.
I'm trying to go on another route, trying to get an OSX virtualmachine, if possible, so I'll be able to debug into the code.
@nopara73 yeah we exclude integration tests from the builds on Travis as it's sometimes unreliable. Better run it on a real Mac and see what you get :cold_sweat:
as it's sometimes unreliable
By "unreliable" I guess you mean that some tests are flaky (sometimes pass, sometimes fail). The ideal thing to do in this situation is finding which of the tests are flaky and disable them, not disable the whole test suite.
By "unreliable" I guess you mean that some tests are flaky (sometimes pass, sometimes fail).
Raw code should never fail. But code that communicates with a server can fail sometimes. Code that communicates with a peer to peer network is even more unreliable.
Anyway I got an OSX virtualmachine set up, so let's see.
TLDR: The AddressManager fails on OSX High Sierra under 8GB memory. It doesn't fail with 8GB.
@NicolasDorier @lontivero Just to further complicate the issue: it works fine with my virtual machine: OSX High Sierra, the same what the CI uses. (OS X 10.12.6) Any idea what's going on?
Update
I lowered the requirements to match exactly of the CI. And it failed this time. I'll try to repeat it see if I get consistent results and report back:
8GB memory -> 4GB memory
4 CPU -> 2 CPU
Increased the CPU count and memory back: it works again (tested 2 times.)
I decreased the CPU count and memory to CI levels and it failed again.
I think we can say it's consistent now. Let's figure out if it's CPU or memory.
I elevated the CPU count (without memory) and it failed.
I set back the CPU count to 2 and elevated the memory to 8GB and it worked.
I lowered the memory to 5GB and it failed.
6GB failed.
7GB failed.
8GB worked again.
Raw code should never fail.
I guess by Raw code you mean unit tests. If yes, I agree.
But code that communicates with a server
An integration test (instead of unit), which should be placed in a different test suite.
can fail sometimes
If the server is not part of the test fixture setup, yes. Ideally, integration tests should set up and bootstrap all the elements they will need themselves. When this doesn't happen, it's when flaky tests start to appear.
The summary of my testing of under what conditions this bug is present: OSX under 8GB memory.
what @nopara73 could you check logs ?
@NicolasDorier I cannot find the logs. Where should I look for them?
inject TraceSourceFactory.CreateTraceSourceFactory
with an instance of Func<TraceSource>
. (see class NullTraceSource to see an empty implementation)
Meh, I don't get DI. Could you make a PR? I guess it's a one liner: https://github.com/nopara73/NBitcoinBitcoinP2pOsxBug/blob/master/NBitcoinP2pOsxBug/Program.cs
@NicolasDorier @lontivero Here you go I've found the problem, not sure about the solution yet.
An exception is thrown here: https://github.com/MetacoSA/NBitcoin/blob/master/NBitcoin/Protocol/Node.cs#L561
var node = Node.Connect(network, addr.Endpoint, param2);
Which is catched and swallowed here: https://github.com/MetacoSA/NBitcoin/blob/master/NBitcoin/Protocol/Node.cs#L569-L572
catch(SocketException)
{
parameters.ConnectCancellation.WaitHandle.WaitOne(500);
}
With this, a node.Connect() function never returns.
The exception is: "no socket buffer space available."
It seems that this catch block is swallowing too much. The SocketException class has an ErrorCode property that can make error handling more specific. The commit that introduced the call to WaitOne(500) should have made sure it only did it for the specific errorCode that it was expecting, instead of all error codes. Any chance this errorCode can be reproduced again?
@knocte The full exception, this exception is thrown by a million times:
System.Net.Sockets.SocketException (0x80004005): No buffer space available
at System.Net.Sockets.Socket.UpdateStatusAfterSocketErrorAndThrowException(SocketError error, String callerName)
at System.Net.Sockets.Socket.SetSocketOption(SocketOptionLevel optionLevel, SocketOptionName optionName, Int32 optionValue, Boolean silent)
at System.Net.Sockets.Socket.SetSocketOption(SocketOptionLevel optionLevel, SocketOptionName optionName, Int32 optionValue)
at System.Net.Sockets.Socket.set_ReceiveBufferSize(Int32 value)
at NBitcoin.Protocol.Node..ctor(NetworkAddress peer, Network network, NodeConnectionParameters parameters) in /Users/user/Documents/NBitcoinBitcoinP2pOsxBug/NBitcoin/Protocol/Node.cs:line 667
at NBitcoin.Protocol.Node.Connect(Network network, IPEndPoint endpoint, NodeConnectionParameters parameters) in /Users/user/Documents/NBitcoinBitcoinP2pOsxBug/NBitcoin/Protocol/Node.cs:line 634
at NBitcoin.Protocol.Node.Connect(Network network, NodeConnectionParameters parameters, IPEndPoint[] connectedEndpoints, Func`2 getGroup) in /Users/user/Documents/NBitcoinBitcoinP2pOsxBug/NBitcoin/Protocol/Node.cs:line 561
@wintercooled tried it on 4GB Sierra High and it worked for him. Maybe this problem is specific for virtual machines?
In my short experience with OSX i could see that it imposes more strict limits in many aspects. I had to deal with ulimit (max opened files) days ago. In this case the problem seems to be the sockets send/receive buffers.
I've changed hardcoded smaller buffer size values and the @nopara73 test now pass green in travis.
I must make the test with the appropiate NodeConnectionParameters
values and try again. In case it works (it has to) i will let you know then, @NicolasDorier can dicatate the way to go (probably reducing the default values in NodeConnectionParameters
I can make the tests pass simply changing the NodeConnectionParameters parameters (below my change):
- var connectionParameters = new NodeConnectionParameters();
+ var connectionParameters = new NodeConnectionParameters()
+ {
+ ReceiveBufferSize = 100 * 5000,
+ SendBufferSize = 100 * 1000
+ };
The article Mac OSX Tuning says that
Starting with OSX 10.5, Apple added "self tuning TCP", and increased the default maximum socket buffer to 4MB
The article also says that the default new values for Settings for OSX Yosemite/Mavericks/Sierra are:
net.inet.tcp.win_scale_factor: 3 (or 5 in newer versions) net.inet.tcp.autorcvbufmax: 1048576 (or 2097152 in newer versions ) net.inet.tcp.autosndbufmax: 1048576
in NBitcoin, the default values are hardcoded as follow:
ReceiveBufferSize = 1000 * 5000;
SendBufferSize = 1000 * 1000;
IMO, even when NBitcoin works perfectly well and can be parametrized easily, I think those defaults values should be modify in a way that osx programmers don't have to deal with this. I'm not sure what the best strategy is (changing the values to fit the known max limits or let the default values or other)
This is not a blocker for us.
I think those defaults values should be modify in a way that osx programmers don't have to deal with this
This is a good suggestion. Either way, the catch block should be modified to rethrow this exception in this case, so that it stops failing silently and the user can adjust these values manually in case he needs to, to make it work.
@lontivero can you check that setting both to 1048576 works fine? Also is there an exception thrown if it is above ?
Made a commit which should fix it, can you try just to be sure ?
Can you also check if MacOS is throwing exception if it is above that I can catch?
@lontivero last question, can you also try
socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.SendBuffer, size);
with size being above the threshold?
@NicolasDorier it works with the the documented max values. Here you can see my commit and the travis CI build is still green
About you last question, I am working with linux right now (this env is new for me) and i don't have ILSpy but I am 99.99% sure that socket.ReceiveBufferSize = an_int_value
is the same that socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.SendBuffer, size);
I am wondering if an exception is thrown so I can attempt to cap the value and retry at Node.cs level in case it is higher.
I have no way to test it. I just believe that what @nopara73 said it accurate and it is swallowing the exceptions.
And yes, I was right, ReceiveBufferSize
is implemented using SetSocketOption
so, there is no need to test it. Here the source: https://referencesource.microsoft.com/#System/net/System/Net/Sockets/Socket.cs,635
Thanks for your testing, I am publishing new NBitcoin version now.
Ok pushed 4.0.0.57
should be available in 10-60 minutes on nuget.
Closing this, reopen if problem persist
Update. The AddressManager fails on OSX High Sierra under 8GB memory. It doesn't fail with or over 8GB.
I created a repository that reproduces this issue with a few lines of code in the
Program.cs
.Repository: https://github.com/nopara73/NBitcoinBitcoinP2pOsxBug
I also added travis CI integration that shows it runs on Linux, but it doesn't run on OSX. (It runs on Windows, too.)
Travis builds: https://travis-ci.org/nopara73/NBitcoinBitcoinP2pOsxBug/builds/344928253