barchart / barchart-udt

Java wrapper for native C++ UDT protocol.
https://github.com/barchart/barchart-udt/wiki
128 stars 89 forks source link

benchmarks #21

Open carrot-garden opened 11 years ago

carrot-garden commented 11 years ago

@CCob in case you care to look, I started putting caliper benchmarks here https://github.com/barchart/netty-udt/tree/bench/bench this http://microbenchmarks.appspot.com/user/Andrei.Pozolotin@gmail.com/ tcp.NativeXferBench udt.NativeXferBench shows that crossing into JNI costs 10 times more (500 ns vs 5000 ns) for UDT then for TCP; if you have any ideas - please let me know :-)

CCob commented 11 years ago

It could be to do with the slow start algorithm inside UDT. Have you tried delaying the timing routine until the slow start part is over. Not sure if it would be easy to determine that from Java. You could just start transferring data for say 5-10 seconds, then start the actual benchmark

carrot-garden commented 11 years ago

good point. I will take a look.

CCob commented 11 years ago

I noticed you have made a 2.2 release with some benchmark changes prior, was the slow start the culprit?

carrot-garden commented 11 years ago

0) no, release is driven by netty

1) "slow start the culprit" is still under review

more food for thought for you: this bench https://github.com/barchart/netty-udt/blob/master/transport-udt/src/test/java/io/netty/transport/udt/bench/xfer/UdtNative.java

results show: http://microbenchmarks.appspot.com/run/Andrei.Pozolotin@gmail.com/io.netty.transport.udt.bench.xfer.UdtNative

that netty does fulfill its promise and gives 20 MB/sec bandwidth with 30 ms network latency and 100K sized messages.

I looked at latency from 0 to 500 ms, 200 MB/sec @ 0 ms becomes 20 MB/sec @ 5 ms and stays that way till 500 ms, then starts to decline again slowly.

however it brings questions:

2) how can we bring up plateau/limit above 20 MB/sec?

3) how can we improve performance for small message sizes?

CCob commented 11 years ago

I have done some benchmarks of my own, and it seems there are definite issues with performance. I have compared the output from Java appclient to the equivalent C++ app from the UDT library and it seems the CWnd on Java remains very low in comparison to the C++ version and usPktSndPeriod is much higher in Java than the C++ counterpart.

I'm looking into it further and will let you know what I find.

carrot-garden commented 11 years ago

great. thank you for the update.

CCob commented 11 years ago

I think your original theory of crossing the JNI boundary might be correct. I have a feeling that the latency involved, especially when using byte[] vs ByteBuffer JNI send function is affecting UDT's congestion control. I'm looking through the OpenJDK now to see how it deals with send/recv function calls, but it wouldn't surprise me if HotSpot actually doesn't use JNI for those calls and does some inline JIT code when it sees calls to the native send/recv functions, in a similar fashion as it deals with put calls on direct ByteBuffers.

CCob commented 11 years ago

http://hg.openjdk.java.net/jdk6/jdk6-gate/jdk/file/f4bdaaa86ea8/src/windows/native/java/net/SocketOutputStream.c

Here is OpenJDK's implementation of OutputStream over a socket, which seems pretty standard to be honest, so at this point in time I am a little unsure of reasons why TCP performs better.

carrot-garden commented 11 years ago

hmm... when you checked appserver+appclient : c++ vs java - did you build them with same options as NAR uses?

carrot-garden commented 11 years ago

BTW I just remembered another possible performance issue: udt is pig and allocates 2 native threads for each socket (snd/rcv queue) question: do they have small/easy/portable c++ thread pool lib for that?

CCob commented 11 years ago

No, the compile options where the default from the UDT sources.

In regards to the threads, at the moment I am only testing 1<-->1 connection, so I don't think that is the issue, but it certainly wont scale well for 100's of connections over UDT.

Once thing I have noticed is that the default send buffer size is 64k, which means this is the maximum you will be able to transfer in one go from Java->C++ before returning back to Java. I will try to increase this to a larger buffer tomorrow to see if this has an affect on performance.

Interestingly enough, I used the C++ appclient and Java appserver and performance was the same as using the C++ appserver, so the bottleneck is in the sender not the receiver which also points to buffer size, since the default receive buffer size is in excess of 10MB on a UDT socket.which means you are transferring much more data in one JNI call when reading from a socket vs writing to it.

carrot-garden commented 11 years ago

great info; I will also look into this.

CCob commented 11 years ago

I have found the cause of the performance issue I was seeing on Windows. It's related to Windows using the Fast IO path or not based on the data gram packet size configured in the Windows registry key HKLM\System\CurrentControlSet\Services\Afd\Parameters\FastSendDatagramThreshold.

UDT defaults to 1500. This causes UDT's congestion control problems due to buffering inside windows due to it not using the Fast IO path within Windows.

By changing the MSS option on the UDT socket to a smaller value I get much better performance.

socket.socketUDT().setOption(OptionUDT.UDT_MSS, 1052);

The big clue was in the appclient.cpp from the UDT code.

CCob commented 11 years ago

With the above in place, on a local LAN I can get 400Mb/s using the Java appclient and 460Mb/s using the native C++ counterpart.

Analyzing the CPU, in both cases it maxes 100% CPU on a single core due to the single SndQueue thread managing the congestion control and packet sending.

So it seams that there is around a 15% decrease in performance when using Java, and I imagine this is down the the JNI layer transitions and conversion from byte[] and ByteBuffers to char* in C++. So the smaller and more frequent transfers to the JNI layer is likely to cause more issues than larger application level buffers. But I have not confirmed this to be the case.

So at this stage I am happy with a 15% decrease vs UDT's native counterpart, and at this stage can't see how we can improve on it using the currently methodology of utilizing UDT at the C++ layer

carrot-garden commented 11 years ago

re: "OptionUDT.UDT_MSS, 1052" silly me - I actually already run into this but then forgot! :-)

carrot-garden commented 11 years ago

should we put "OptionUDT.UDT_MSS, 1052" by default when detecting windows?

carrot-garden commented 11 years ago

re: "Analyzing the CPU, in both cases it maxes 100% CPU" what do you use to profile c++?

carrot-garden commented 11 years ago

re: "at this stage I am happy with a 15% decrease" - did you try direct buffers instead of arrays? there is no copy involved with direct buffers.

carrot-garden commented 11 years ago

re: "on a local LAN I can get 400Mb/s" did you try to introduce delays? see TrafficControl.java

CCob commented 11 years ago

Hmm good question. Well you could get further performance out of Windows by adjusting the FastSendDatagramThreshold which means you'll get the best of both worlds. So if you force the MSS to 1052 you wont see any benefit when setting FastSendDatagramThreshold to something higher.

Perhaps the default should be to set it to 1052, and maybe a property you can set to override this behavior for users who have updated the FastSendDatagramThreshold value.

"what do you use to profile c++" - in this particular case I simply used performance monitor on Windows. But any analysis for specific hot spots I use AMD CodeAnalyst.

"did you try direct buffers instead of arrays" - Yes, I updated the appclient to use direct ByteBuffers instead of byte[], in fact I had already done that before finding the MSS issue, so it maybe worse with byte[].

"did you try to introduce delays" - No, this was a simple 1<---->1 connection.

CCob commented 11 years ago

Do you know of any Windows equivalent to 'tc' so that I can test C++/Java appclient with latency introduced.

CCob commented 11 years ago

I'll give WANem a try since it's a LiveCD and no installation is necessary.

carrot-garden commented 11 years ago

wanem probably easiest to get started. this is probably more current or try cisco nist or make own linux with netem . or mess with msvc newt.

carrot-garden commented 11 years ago

got an answer from Yunhong Gu re: "udt is pig and allocates 2 native threads for each socket":

You can share these sockets on the same port, unless you have a reason not to. 
Two threads are created for each UDP port your open, not for each UDT socket. 
Thus, you can run 110 sockets with only two threads. 
Create socket with UDT_REUSEADDR option, then explicitly bind sockets on the same port.
CCob commented 11 years ago

So does that ring true for server sockets too, when you accept does it automatically create a new UDP binding to a different UDP port or does it bind to the same port as the server is listening on?

CCob commented 11 years ago

I don't have concrete numbers yet, but using a VM running WANulator the appclient for C++ and Java seem to perform roughly the same when latency gets introduced, below are tests where WANulator had latency of around 180, which is roughly what I get for pings to servers in LA from the UK. First is the C++ appclient followed by Java appclient

SendRate(Mb/s)  RTT(ms) CWnd    PktSndPeriod(us)        RecvACK RecvNAK WT(us)
1.49745         107.165 140     1                       5       0       10472
49.3417         148.746 4843    68                      12      15      126456
71.8588         164.734 9034    111                     5       4       26502
84.2646         178.534 16821   101.5                   127     0       7780
92.2888         178.659 16155   93                      123     0       8888
89.2824         178.448 17440   97.5                    101     1       8971
85.266          178.547 14915   102                     105     1       9431
97.3497         178.815 16057   74                      119     0       8824
88.4056         178.791 14981   98.5                    97      3       9121
70.3212         178.421 12994   126.5                   128     5       10943
75.306          178.57  13172   112.5                   168     0       10939
83.1319         179.047 14637   102.5                   133     0       9876
98.7295         178.668 16231   69                      123     0       8772
67.887          180.442 15589   101                     29      41      7070
46.8308         180.849 13801   144                     9       5       21608
67.2912         178.166 12806   123.5                   164     0       9854
76.8257         178.373 14696   110.5                   155     0       10763
84.9564         178.582 14610   100                     131     0       9724
92.6469         179.613 13664   106                     94      3       8974
87.9044         178.706 15502   97                      113     0       9140
106.947         179.072 17671   63                      125     0       8198
120.859         179.082 18926   77.5                    108     7       6596
120.292         179.815 15440   71.5                    149     0       6856
58.1574         180.32  13998   103.5                   3       5       10724
69.4597         179.003 16476   112                     70      1       16540
83.5923         178.548 14661   102                     129     0       9760
121.404         179.253 19054   64                      124     1       7612
108.217         179.051 12081   87                      106     5       6836
91.3212         179.24  10325   106                     88      3       8781
72.6415         180.456 7750    143.5                   91      10      10699
67.1196         179.62  11208   124                     158     0       12297
81.2635         179.056 17762   84.5                    153     0       10558
SendRate(Mb/s)  RTT(ms) CWnd    PktSndPeriod(us)        RecvACK RecvNAK WT(us)
1.328           126.271 139     1.00                    3       0       576
58.438          166.433 5688    98.00                   10      27      870019
88.962          179.502 13754   130.00                  94      3       47683
50.470          179.859 11765   167.00                  71      4       70412
55.565          179.745 12501   144.00                  102     0       77511
64.542          179.468 13428   111.00                  102     0       68313
103.567         179.973 16219   94.00                   76      6       45019
95.876          179.650 17046   86.00                   102     0       43076
72.805          179.304 13858   114.00                  83      3       54406
76.009          179.207 14465   119.00                  88      1       54885
75.937          179.206 14406   122.00                  99      1       56273
73.538          179.177 16069   125.00                  87      1       57099
72.283          179.612 15761   113.00                  89      0       58004
101.837         180.181 13916   73.00                   102     4       47394
107.397         180.317 15103   79.00                   68      7       39170
112.882         180.095 16755   84.00                   103     1       37799
94.069          179.254 18006   88.00                   86      3       43508
101.700         179.876 15258   151.00                  103     5       41793
59.556          179.229 14828   135.00                  81      3       63533
67.658          179.133 13606   120.00                  98      0       63461
75.546          178.727 14843   108.00                  106     0       56936
83.834          179.274 15462   98.00                   102     0       51374
80.561          179.166 15729   103.00                  87      2       52059
95.743          179.098 17033   69.00                   103     0       47579
128.168         181.139 13627   86.00                   66      32      33337
81.960          179.263 15143   102.00                  82      2       49742
88.537          178.824 16469   93.00                   104     0       48361
85.842          179.647 17032   110.00                  84      2       48394
81.238          179.262 17335   101.00                  86      0       51383
98.530          179.470 17658   66.00                   103     0       46838
69.402          179.635 17015   114.00                  13      15      47876
64.303          181.080 16738   110.00                  32      0       91858
81.962          179.398 16997   100.00                  101     0       52239
89.989          179.099 16303   91.00                   105     0       47614
88.076          179.121 15952   95.00                   85      1       47495
92.841          179.832 15597   95.00                   89      4       45191
76.913          179.314 14689   112.00                  87      2       53004
80.979          179.620 16529   101.00                  104     0       52854

The switching capability of the VM is pretty poor on a laptop in comparison to some real server hardware, so on Monday I will try with some real hardware to see where we go.

carrot-garden commented 11 years ago

re: "had latency of around 180" - lets agree on common latency ladder for benchmarks?

re: "SendRate(Mb/s)" - is it bytes or bits per second?

re: "using a VM running WANulator" - yes, I think using vm is no good for matchmaking.

CCob commented 11 years ago

how about increments of 100ms from 0 to 500ms?

the rate in in bits, poor laptop and VM, what can I say

carrot-garden commented 11 years ago

re: "true for server sockets too?" answer:

accept() socket reuse the same port of the listen() socket.
UDT_REUSEADDR applies to rendezvous socket too.
CCob commented 11 years ago

Got an update for you on a real machine running WANulator.

I also tried at 400ms, 300ms, 200ms and 100ms with increases in bandwidth at each drop. Although when I come down to the 100ms mark, it is already close to max rate for my setup at 0ms so there is little change between 0-150ms. So I'm not quite sure how you are seeing such a bad performance drop with 20ms latency at present,

carrot-garden commented 11 years ago

very interesting, thank you for sharing.

1) possible source of difference:

2) please try to put your benchmark approach into code so I can reproduce them, take a look on google caliper or my examples above so we can publish results in the same format.

3) please try to run my benchmarks in your setup (probably need externalized delay conifg)

4) re: "little change between 0-150ms" - this is the biggest mystery - I need to get back to it.

carrot-garden commented 11 years ago

still the question remains: how can we move above 20 MB/s = 160 Mb/s saturation?

CCob commented 11 years ago

I am a unfamiliar with caliper but I'll see what I can do re turning my approach into code.

Regarding saturation @ 500ms latency, it's the same behavior with the native C++ appclient too, so the Java wrapper achieves near the same performance as it's native counter part, especially at higher latencies.

If you want to improve on 160Mb/s @ 500ms I fear it's going to be some task, since it will be down to the congestion control algorithm then. 160Mb/s @ 500ms is actually impressive if you compare that to TCP, which at 500ms falls to it's knees.

carrot-garden commented 11 years ago

for caliper 0.5, look here: http://code.google.com/p/caliper/

or, we build current snapshot 1.0 https://github.com/barchart/barchart-caliper and publish it here https://oss.sonatype.org/content/groups/public/com/barchart/bench/barchart-caliper/

CCob commented 11 years ago

So currently barchart-udt on GitHub is not using Caliper, is that right? You seem to be using Metrics by Yammer. Which do you plan on using in the future?

carrot-garden commented 11 years ago

I do use both caliper and metrics, and plan to continue to use both.

look here for example https://github.com/barchart/barchart-udt/tree/master/barchart-udt-netty4/src/test/java/io/netty/transport/udt/bench

caliper pro:

caliper con:

metrics pro:

metrics con:

so: I married metrics to caliper: https://github.com/barchart/barchart-udt/blob/master/barchart-udt-netty4/src/test/java/io/netty/transport/udt/util/CaliperMeasure.java

I suggest you spend some time with both, you may like them.

also: I am advising netty project to accept my approach which it seems they will do.

caveat: both caliper and metrics right now are in the middle of major release changes, caliper 1.0 and metrics 3.0 will be rather different from current releases, but in a month or so the dust should settle.

CCob commented 11 years ago

I noticed the caliper/metrics marry uses a similar benchmark methodology as the BenchXferOne in udt-core. Off the top of your head are you reporting the send rate or the receive rate on the caliper benchmark.

If it's the former then the results are likely to be skewed somewhat. Correct me if I am wrong here, but it looks like you are marking on the bytes sent from the send function, but I believe all this will be doing is bench-marking the rate you are transferring into UDT's send buffer, not the actual send rate of the socket.

The receive side is a more realistic measurement since this is actually measuring bytes received into the application layer against time.

Is there any particular reason you are not using the Monitor on UDT?

carrot-garden commented 11 years ago

re: 'are you reporting the send rate or the receive rate" - currently - send rate. should change to report both.

CCob commented 11 years ago

Send rate will certainly be skewed unless you are reading it from the UDT monitor as opposed to calculating it at application level. Personally I wouldn't even bother reporting it since it's not a true representation of the actual data rate down the socket.

ylangisc commented 10 years ago

Hi,

Just found this discussion concerning performance. As I'm doing some very basic measuring of the performance in my environment I'd like to share my results and hope to get some comments/answers to my question.

Setup:

Client: OS X, 1.5MBit Upload (best effort) Server: AWS EC2 t2.medium instance (Amazon Linux)

What I was trying to find out is how many times a UDT transfer of a big file (32MB) is faster than a simple SCP. To my surprise the SCP transfer was faster.

UDT: 187s SCP: 155s

Do I miss anything in my code? Is there anything I can do to further accelerate the upload (multiple connections, ....)? I excepted UDT to be much faster than SCP.

Client- and server code below (netty 4.0.23.Final based).

Server:

    public static void main(String[] args) throws Exception {
        final NioEventLoopGroup acceptGroup =
                new NioEventLoopGroup(1, new DefaultThreadFactory("accept"), NioUdtProvider.MESSAGE_PROVIDER);
        final NioEventLoopGroup connectGroup =
                new NioEventLoopGroup(1, new DefaultThreadFactory("connect"), NioUdtProvider.MESSAGE_PROVIDER);

        // Configure the server.
        try {
            final ServerBootstrap boot = new ServerBootstrap();
            boot.group(acceptGroup, connectGroup)
                    .channelFactory(NioUdtProvider.MESSAGE_ACCEPTOR)
                    .option(ChannelOption.SO_BACKLOG, 10)
                    .handler(new LoggingHandler(LogLevel.DEBUG))
                    .childHandler(new ChannelInitializer<UdtChannel>() {
                        @Override
                        public void initChannel(final UdtChannel ch)
                                throws Exception {
                            ch.pipeline().addLast(
                                    new LoggingHandler(LogLevel.DEBUG),
                                    new ServerHandler());
                        }
                    });
            // Start the server.
            final ChannelFuture future = boot.bind(PORT).sync();
            // Wait until the server socket is closed.
            future.channel().closeFuture().sync();
        }
        finally {
            // Shut down all event loops to terminate all threads.
            acceptGroup.shutdownGracefully();
            connectGroup.shutdownGracefully();
        }
    }

Client:

    static final int BUFF_SIZE = Integer.parseInt(System.getProperty("buffer", Short.toString(Short.MAX_VALUE)));
    static final int NUM = Integer.parseInt(System.getProperty("num", "10000"));

    public static void main(String[] args) throws Exception {
        // Configure the client.
        final NioEventLoopGroup connectGroup =
                new NioEventLoopGroup(1, new DefaultThreadFactory("connect"), NioUdtProvider.MESSAGE_PROVIDER);

        try {
            final Bootstrap boot = new Bootstrap();
            boot.group(connectGroup)
                    .channelFactory(NioUdtProvider.MESSAGE_CONNECTOR)
                    .handler(new ChannelInitializer<UdtChannel>() {
                        @Override
                        public void initChannel(final UdtChannel ch)
                                throws Exception {
                            ch.pipeline().addLast(
                                    new LoggingHandler(LogLevel.DEBUG),
                                    new ClientHandler());
                        }
                    });
            // Start the client.
            final ChannelFuture f = boot.connect(HOST, PORT).sync();

            final ByteBuf startBuf = Unpooled.buffer(1, 1);
            startBuf.writeByte(0);
            System.out.println("Start packet");
            f.channel().writeAndFlush(new UdtMessage(startBuf));

            final ByteBuf byteBuf = Unpooled.buffer(BUFF_SIZE);
            for(int i = 0; i < byteBuf.capacity(); i++) {
                byteBuf.writeByte((byte) i);
            }

            for(int i = 0; i < NUM; i++) {
                System.out.println("Packet " + i+1);
                f.channel().writeAndFlush(new UdtMessage(byteBuf.copy()));
            }
            final ByteBuf endBuf = Unpooled.buffer(1, 1);
            endBuf.writeByte(1);
            System.out.println("End packet");
            f.channel().writeAndFlush(new UdtMessage(endBuf));

            // Wait until the connection is closed.
            f.channel().closeFuture().sync();
        }
        finally {
            // Shut down the event loop to terminate all threads.
            connectGroup.shutdownGracefully();
        }
    }

Thanks Yves