MetacoSA / NBitcoin

Comprehensive Bitcoin library for the .NET framework.
MIT License
1.88k stars 847 forks source link

NBitcoin doesn't connect to the Bitcoin P2P network only on OSX under 8GB memory #345

Closed nopara73 closed 6 years ago

nopara73 commented 6 years ago

Update. The AddressManager fails on OSX High Sierra under 8GB memory. It doesn't fail with or over 8GB.

I created a repository that reproduces this issue with a few lines of code in the Program.cs.
Repository: https://github.com/nopara73/NBitcoinBitcoinP2pOsxBug

I also added travis CI integration that shows it runs on Linux, but it doesn't run on OSX. (It runs on Windows, too.)

Travis builds: https://travis-ci.org/nopara73/NBitcoinBitcoinP2pOsxBug/builds/344928253

lontivero commented 6 years ago

@NicolasDorier I would like to fix this but i need some help. The point is that i cannot configure the @nopara73's test application in order to get the logs, it seems System.Diagnostics classes cannot be configured in dotnet core 2.0 (take a look at this and this). Troubleshooting without logs is a bit hard, i am blind dealing with this.

Do you have any recomendation? I started writting another logging mechanism to replace System.Diagnostics classes but it doesn't look right. (Just as a side note: dotnet core doesn't provide a console listener neither - i can be worng)

nopara73 commented 6 years ago

As alternative strategies I have a few things I can think of:

NicolasDorier commented 6 years ago

I think NBitcoin should use ILogger from .NETStandard. I am unsure when they started doing it. This ILogger should be injected into the NodeConnectionParameters. You just need to make sure that builds which does not support ILogger still compile though.

NicolasDorier commented 6 years ago

I will take a look to improve logging on NBitcoin.

NicolasDorier commented 6 years ago

@nopara73 right now nbitcoin is using AppVeyor. I am not against adding travis support with running all tests on mac os as well.

NicolasDorier commented 6 years ago

Add a way to extend log right now...

NicolasDorier commented 6 years ago

@lontivero can you use this https://github.com/MetacoSA/NBitcoin/commit/f1dcd9a371cecdbcc7a7f1862c409c9bacc27e36 I added an extension point for logs for environment without TraceSource.

I need to eventually drop TraceSource completely.

nopara73 commented 6 years ago

@dev0tion if I remember well you are using OSX. Did you ever experienced it? @dangershony @bokobza @Aprogiena ? This is a critical problem, how come we never encountered this before? Even downgrading NBitcoin fails: https://github.com/nopara73/NBitcoinBitcoinP2pOsxBug/pull/1

nopara73 commented 6 years ago

Oh as I see Stratis excludes these tests from CI. Is it possible it never even worked? https://github.com/stratisproject/StratisBitcoinFullNode/blob/master/build.sh

# exclude integration tests
if [[ "$testProject" == *"Integration.Tests"* ]] || [[ "$testProject" == *"IntegrationTests"* ]] || [[ "$testProject" == *"NBitcoin.Tests"* ]] ; then
    continue
fi
NicolasDorier commented 6 years ago

@nopara73 I fixed a bug in the netcore version. You can now inject a logger through TraceSourceFactory.CreateTraceSourceFactory at the start of your app. Can you do this ?

By default it is this one, but you can make your own to print into console logs of NBitcoin.

class NullTraceSource : TraceSource
    {
        string n;
        public NullTraceSource(string n)
        {
            this.n = n;
        }
        public Switch Switch
        {
            get;
            set;
        } = new Switch();

        public void TraceEvent(TraceEventType traceEventType, int eventId, string msg, object[] args)
        {

        }

        public void TraceEvent(TraceEventType traceEventType, int eventId, string msg)
        {

        }

        public void TraceInformation(string msg)
        {

        }

        public void TraceTransfer(int eventId, string p2, Guid activity)
        {

        }
    }
nopara73 commented 6 years ago

Logging was @lontivero's request. @lontivero Can you progress with this logging? I am not sure it helps me.

I'm trying to go on another route, trying to get an OSX virtualmachine, if possible, so I'll be able to debug into the code.

bokobza commented 6 years ago

@nopara73 yeah we exclude integration tests from the builds on Travis as it's sometimes unreliable. Better run it on a real Mac and see what you get :cold_sweat:

knocte commented 6 years ago

as it's sometimes unreliable

By "unreliable" I guess you mean that some tests are flaky (sometimes pass, sometimes fail). The ideal thing to do in this situation is finding which of the tests are flaky and disable them, not disable the whole test suite.

nopara73 commented 6 years ago

By "unreliable" I guess you mean that some tests are flaky (sometimes pass, sometimes fail).

Raw code should never fail. But code that communicates with a server can fail sometimes. Code that communicates with a peer to peer network is even more unreliable.

Anyway I got an OSX virtualmachine set up, so let's see.

nopara73 commented 6 years ago

TLDR: The AddressManager fails on OSX High Sierra under 8GB memory. It doesn't fail with 8GB.

@NicolasDorier @lontivero Just to further complicate the issue: it works fine with my virtual machine: OSX High Sierra, the same what the CI uses. (OS X 10.12.6) Any idea what's going on?

Update

I lowered the requirements to match exactly of the CI. And it failed this time. I'll try to repeat it see if I get consistent results and report back:
8GB memory -> 4GB memory
4 CPU -> 2 CPU

Increased the CPU count and memory back: it works again (tested 2 times.)
I decreased the CPU count and memory to CI levels and it failed again.

I think we can say it's consistent now. Let's figure out if it's CPU or memory.
I elevated the CPU count (without memory) and it failed.
I set back the CPU count to 2 and elevated the memory to 8GB and it worked.
I lowered the memory to 5GB and it failed.
6GB failed.
7GB failed.
8GB worked again.

knocte commented 6 years ago

Raw code should never fail.

I guess by Raw code you mean unit tests. If yes, I agree.

But code that communicates with a server

An integration test (instead of unit), which should be placed in a different test suite.

can fail sometimes

If the server is not part of the test fixture setup, yes. Ideally, integration tests should set up and bootstrap all the elements they will need themselves. When this doesn't happen, it's when flaky tests start to appear.

nopara73 commented 6 years ago

The summary of my testing of under what conditions this bug is present: OSX under 8GB memory.

NicolasDorier commented 6 years ago

what @nopara73 could you check logs ?

nopara73 commented 6 years ago

@NicolasDorier I cannot find the logs. Where should I look for them?

NicolasDorier commented 6 years ago

inject TraceSourceFactory.CreateTraceSourceFactory with an instance of Func<TraceSource>. (see class NullTraceSource to see an empty implementation)

nopara73 commented 6 years ago

Meh, I don't get DI. Could you make a PR? I guess it's a one liner: https://github.com/nopara73/NBitcoinBitcoinP2pOsxBug/blob/master/NBitcoinP2pOsxBug/Program.cs

nopara73 commented 6 years ago

@NicolasDorier @lontivero Here you go I've found the problem, not sure about the solution yet.

An exception is thrown here: https://github.com/MetacoSA/NBitcoin/blob/master/NBitcoin/Protocol/Node.cs#L561

var node = Node.Connect(network, addr.Endpoint, param2);

Which is catched and swallowed here: https://github.com/MetacoSA/NBitcoin/blob/master/NBitcoin/Protocol/Node.cs#L569-L572

catch(SocketException)
{
    parameters.ConnectCancellation.WaitHandle.WaitOne(500);
}

With this, a node.Connect() function never returns.

The exception is: "no socket buffer space available."

knocte commented 6 years ago

It seems that this catch block is swallowing too much. The SocketException class has an ErrorCode property that can make error handling more specific. The commit that introduced the call to WaitOne(500) should have made sure it only did it for the specific errorCode that it was expecting, instead of all error codes. Any chance this errorCode can be reproduced again?

nopara73 commented 6 years ago

@knocte The full exception, this exception is thrown by a million times:

System.Net.Sockets.SocketException (0x80004005): No buffer space available
   at System.Net.Sockets.Socket.UpdateStatusAfterSocketErrorAndThrowException(SocketError error, String callerName)
   at System.Net.Sockets.Socket.SetSocketOption(SocketOptionLevel optionLevel, SocketOptionName optionName, Int32 optionValue, Boolean silent)
   at System.Net.Sockets.Socket.SetSocketOption(SocketOptionLevel optionLevel, SocketOptionName optionName, Int32 optionValue)
   at System.Net.Sockets.Socket.set_ReceiveBufferSize(Int32 value)
   at NBitcoin.Protocol.Node..ctor(NetworkAddress peer, Network network, NodeConnectionParameters parameters) in /Users/user/Documents/NBitcoinBitcoinP2pOsxBug/NBitcoin/Protocol/Node.cs:line 667
   at NBitcoin.Protocol.Node.Connect(Network network, IPEndPoint endpoint, NodeConnectionParameters parameters) in /Users/user/Documents/NBitcoinBitcoinP2pOsxBug/NBitcoin/Protocol/Node.cs:line 634
   at NBitcoin.Protocol.Node.Connect(Network network, NodeConnectionParameters parameters, IPEndPoint[] connectedEndpoints, Func`2 getGroup) in /Users/user/Documents/NBitcoinBitcoinP2pOsxBug/NBitcoin/Protocol/Node.cs:line 561
nopara73 commented 6 years ago

@wintercooled tried it on 4GB Sierra High and it worked for him. Maybe this problem is specific for virtual machines?

lontivero commented 6 years ago

In my short experience with OSX i could see that it imposes more strict limits in many aspects. I had to deal with ulimit (max opened files) days ago. In this case the problem seems to be the sockets send/receive buffers.

I've changed hardcoded smaller buffer size values and the @nopara73 test now pass green in travis.

I must make the test with the appropiate NodeConnectionParameters values and try again. In case it works (it has to) i will let you know then, @NicolasDorier can dicatate the way to go (probably reducing the default values in NodeConnectionParameters

lontivero commented 6 years ago

I can make the tests pass simply changing the NodeConnectionParameters parameters (below my change):

-             var connectionParameters = new NodeConnectionParameters();
+            var connectionParameters = new NodeConnectionParameters()
+            {
+                ReceiveBufferSize = 100 * 5000,
+                SendBufferSize = 100 * 1000
+            };

The article Mac OSX Tuning says that

Starting with OSX 10.5, Apple added "self tuning TCP", and increased the default maximum socket buffer to 4MB

The article also says that the default new values for Settings for OSX Yosemite/Mavericks/Sierra are:

net.inet.tcp.win_scale_factor: 3 (or 5 in newer versions) net.inet.tcp.autorcvbufmax: 1048576 (or 2097152 in newer versions ) net.inet.tcp.autosndbufmax: 1048576

in NBitcoin, the default values are hardcoded as follow:

ReceiveBufferSize = 1000 * 5000;  
SendBufferSize = 1000 * 1000;

IMO, even when NBitcoin works perfectly well and can be parametrized easily, I think those defaults values should be modify in a way that osx programmers don't have to deal with this. I'm not sure what the best strategy is (changing the values to fit the known max limits or let the default values or other)

This is not a blocker for us.

knocte commented 6 years ago

I think those defaults values should be modify in a way that osx programmers don't have to deal with this

This is a good suggestion. Either way, the catch block should be modified to rethrow this exception in this case, so that it stops failing silently and the user can adjust these values manually in case he needs to, to make it work.

NicolasDorier commented 6 years ago

@lontivero can you check that setting both to 1048576 works fine? Also is there an exception thrown if it is above ?

NicolasDorier commented 6 years ago

Made a commit which should fix it, can you try just to be sure ?

NicolasDorier commented 6 years ago

Can you also check if MacOS is throwing exception if it is above that I can catch?

NicolasDorier commented 6 years ago

@lontivero last question, can you also try

socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.SendBuffer, size);

with size being above the threshold?

lontivero commented 6 years ago

@NicolasDorier it works with the the documented max values. Here you can see my commit and the travis CI build is still green

About you last question, I am working with linux right now (this env is new for me) and i don't have ILSpy but I am 99.99% sure that socket.ReceiveBufferSize = an_int_value is the same that socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.SendBuffer, size);

NicolasDorier commented 6 years ago

I am wondering if an exception is thrown so I can attempt to cap the value and retry at Node.cs level in case it is higher.

lontivero commented 6 years ago

I have no way to test it. I just believe that what @nopara73 said it accurate and it is swallowing the exceptions.

And yes, I was right, ReceiveBufferSize is implemented using SetSocketOption so, there is no need to test it. Here the source: https://referencesource.microsoft.com/#System/net/System/Net/Sockets/Socket.cs,635

NicolasDorier commented 6 years ago

Thanks for your testing, I am publishing new NBitcoin version now.

NicolasDorier commented 6 years ago

Ok pushed 4.0.0.57 should be available in 10-60 minutes on nuget.

NicolasDorier commented 6 years ago

Closing this, reopen if problem persist