RuedigerMoeller / fast-cast

hp low latency reliable multicast messaging
GNU Lesser General Public License v3.0
122 stars 17 forks source link

Bug with immediate send to FC #4

Closed dmart28 closed 8 years ago

dmart28 commented 8 years ago

Hi guys.

I am facing a trouble which I suspect to be a bug anyway. My case is next: I am creating two FastCast instances in same process, then init them both (one by one) like this:

fastCast.onTransport(config.transportName()).subscribe(fastCast.getSubscriberConf(config.topicName()), new FCSubscriber() {
            @Override
            public void messageReceived(String sender, long sequence, Bytez b, long off, int len) {
                // some fancy logic
            }

            @Override
            public boolean dropped() {
                return true;
            }

            @Override
            public void senderTerminated(String senderNodeId) {
            }

            @Override
            public void senderBootstrapped(String receivesFrom, long seqNo) {
            }
        });
// I want publisher to be pre-inited as earlier as possible
        publisher = fastCast.onTransport(config.transportName()).publish(fastCast.getPublisherConf(config.topicName()));

And immediately after that I start to call offer(..) on one FC, and expect message to be delivered on another. But most of the time it's not the case, this message just got lost, despite that offer returned true!

The only case is to do some Thread.sleep(1000) between init and calling offer. So, to conclude:

1) My config is just fine, and I got 100% time messages if I sleep after init for some amount of time. 2) Most of the time first message is almost lost if I start to offer immediately.

Really appreciate for support, since I'd like to believe such great thing is usable as a durable instrument and my transactions will not become lost garbage in space :)

RuedigerMoeller commented 8 years ago

hm .. there has been a fair amount of tests and fast-cast 1.0 acually runs in a production system (many changes since then, though). I never tested with two instance inside one process (there might be unforeseen consequences, singletons :) ). Does the same happen if you test with 2 separate processes ?

I'll investigate unresolved fast-cast issues tomorrow evening .. haven't looked into it for a while. Cheers

ecmnet commented 8 years ago

I did not have that issue using two different topics for publish and subscribe...

dmart28 commented 8 years ago

I haven't tried with 2 separate, as it's no sense to do if it doesn't work in one for me. I am using one topic also.

OK, so all I can do now is to try to provide you with running example on which I am having troubles. I will do it soon. Thanks for the prompt answers so far.

dmart28 commented 8 years ago

Sharing with you an example app with bug reproduced. If you uncomment Thread.sleep(..) line, it will work fine, otherwise hang forever awaiting for message. Hope it should help you in investigation.

https://drive.google.com/file/d/0B7fxuj58GeQkVF93X0VkOUtsT0k/view?usp=sharing

RuedigerMoeller commented 8 years ago

Hi, unfortunately i cannot access google drive from the office (have to check back from home).

However just from "dry" analysis, in process multicast on same topic cannot work properly. [edit: it works if those instances have different node ids] Reason:

In addition be aware that edgy configuration settings i ran with kernel bypass drivers won't work on stock hardware. keep pps <= 10_000, packet size 2-4k

dmart28 commented 8 years ago

I see, but the issue is that I am generating unique NodeId for each FastCast instance, and then start them. So I expected no overlapping to be caused in that way.

dmart28 commented 8 years ago

Yes, I use this things for testing solely (two instances in one process for example)

RuedigerMoeller commented 8 years ago

usually FastCast.getFastCast() is used. I am not sure if this works with multiple instances (e.g. some internal code might access the fastcast singleton). Trying to reproduce, stay tuned

RuedigerMoeller commented 8 years ago

I made a test (with crappy win7 localhost implementation)

https://github.com/RuedigerMoeller/fast-cast/tree/3.0/src/test/java/basic/stuff

1) don't use "sendImmediate"/"flush" flag unless you are in a low latency environment with appropriate network hardware and stack. This flag creates many packages, so bad network stack impl / bad network hardware messes up quickly. 2) when sending + receiving on localhost/windoze (i use linux usually) on a single topic, i needed to lower the pps to 5000 to get reliable throughput (see sample conf). pss window should be default (=1000)

next: test single process

RuedigerMoeller commented 8 years ago

Hi, send/receive test is here:

https://github.com/RuedigerMoeller/fast-cast/tree/3.0/src/test/java/basic/stuff

It worked fine (no loss)

no issues, no message loss so far. Note that I assign different node id's to each fasctcast instance + lowered pps. A ppsWindow of 100 as present in some of my sample does not work on stock hardware/os'es, so I removed it from the config (defaults to 1000 then).

RuedigerMoeller commented 8 years ago

ensure you are using fst 2.19, i am currently investigating why fast-cast is not running with newer fst releases

dmart28 commented 8 years ago

Really appreciate your work. It seems it really works for me now. Just a question - it seems I am using branch 3.0. Does it mean I use somehow 2.19 or what?

dmart28 commented 8 years ago

Ah, I see. Yes, was using FST 2.19.

RuedigerMoeller commented 8 years ago

branch 3.0 is the correct default "master". i just messed up github was too lazy to clean up :)

dmart28 commented 8 years ago

What I noticed finally, is that if I send just single message (not millions), even thought I sent it with flush=true and so on, it never reaches until I sent more ones. Is it feature? Don't I have any chances to push it further intentionally?

RuedigerMoeller commented 8 years ago

that might be a bug .. let me test it (we have 100 thousands of messages, so this might not show up as the network never is quiet :) ).

"flush" should trigger instantly. flush = false within 1-3 milliseconds (config).

dmart28 commented 8 years ago

OK, it seems it was bug in my code. I really thankful for your quick responses and help. I think this one might be closed, much things are clear now.

RuedigerMoeller commented 8 years ago

I cannot reproduce this. Added another test in https://github.com/RuedigerMoeller/fast-cast/tree/3.0/src/test/java/basic/stuff .

I am now under linux at home (centos 7). The test sends a timestamp so receiver can measure delay of a message.

prints the following:

t707-p5d0 sending 0 1446489444404
t405-glcd received 0 delay: 1
t405-glcd sending 2 1446489444459
t707-p5d0 received 2 delay: 0
t707-p5d0 sending 1 1446489445167
t405-glcd received 1 delay: 1
t707-p5d0 sending 2 1446489446255
t405-glcd received 2 delay: 0
t405-glcd sending 3 1446489448160
t707-p5d0 received 3 delay: 0
t707-p5d0 sending 3 1446489449960
t405-glcd received 3 delay: 1
t707-p5d0 sending 4 1446489451250
t405-glcd received 4 delay: 1
t707-p5d0 sending 5 1446489452011
t405-glcd received 5 delay: 0
t405-glcd sending 4 1446489452188
t707-p5d0 received 4 delay: 1
t405-glcd sending 5 1446489452426
t707-p5d0 received 5 delay: 0
t707-p5d0 sending 6 1446489453852
t405-glcd received 6 delay: 0
t405-glcd sending 6 1446489456079
t707-p5d0 received 6 delay: 0
t707-p5d0 sending 7 1446489457858
t405-glcd received 7 delay: 1
t405-glcd sending 7 1446489459202
t707-p5d0 received 7 delay: 1
t707-p5d0 sending 8 1446489462390
t405-glcd received 8 delay: 0
t405-glcd sending 8 1446489463846
t707-p5d0 received 8 delay: 1
t405-glcd sending 9 1446489466307
t707-p5d0 received 9 delay: 0
t707-p5d0 received: 10
t707-p5d0 sending 9 1446489467328
t405-glcd received 9 delay: 0
t405-glcd received: 10
t405-glcd sending 0 1446489468926
t707-p5d0 received 0 delay: 1
t405-glcd sending 1 1446489470957
t707-p5d0 received 1 delay: 1
t707-p5d0 sending 0 1446489471415
t405-glcd received 0 delay: 1
t707-p5d0 sending 1 1446489471754
t405-glcd received 1 delay: 0
t707-p5d0 sending 2 1446489471807
t405-glcd received 2 delay: 0
t405-glcd sending 2 1446489471898
t707-p5d0 received 2 delay: 0
t405-glcd sending 3 1446489473665
t707-p5d0 received 3 delay: 0
t707-p5d0 sending 3 1446489473740
t405-glcd received 3 delay: 1
t405-glcd sending 4 1446489474683
t707-p5d0 received 4 delay: 1
t707-p5d0 sending 4 1446489474786
t405-glcd received 4 delay: 1
t405-glcd sending 5 1446489475610
t707-p5d0 received 5 delay: 1
t405-glcd sending 6 1446489476561
t707-p5d0 received 6 delay: 0
t707-p5d0 sending 5 1446489478508
t405-glcd received 5 delay: 0
t405-glcd sending 7 1446489479763
t707-p5d0 received 7 delay: 1
t405-glcd sending 8 1446489480072
t707-p5d0 received 8 delay: 1
t405-glcd sending 9 1446489480243
t707-p5d0 received 9 delay: 0
t707-p5d0 received: 10
t707-p5d0 sending 6 1446489482308
t405-glcd received 6 delay: 0
t707-p5d0 sending 7 1446489483204
t405-glcd received 7 delay: 1
t405-glcd sending 0 1446489484750
t707-p5d0 received 0 delay: 0
t405-glcd sending 1 1446489486951
t707-p5d0 received 1 delay: 1
t707-p5d0 sending 8 1446489487651
t405-glcd received 8 delay: 0
t405-glcd sending 2 1446489487734
t707-p5d0 received 2 delay: 0
t707-p5d0 sending 9 1446489489898
t405-glcd received 9 delay: 0
t405-glcd received: 10
t405-glcd sending 3 1446489491235
t707-p5d0 received 3 delay: 0
t707-p5d0 sending 0 1446489492249
t405-glcd received 0 delay: 1
t405-glcd sending 4 1446489492924
t707-p5d0 received 4 delay: 0
t707-p5d0 sending 1 1446489494980
t405-glcd received 1 delay: 1
t405-glcd sending 5 1446489495740
t707-p5d0 received 5 delay: 1
RuedigerMoeller commented 8 years ago

note on bandwith: current settings limit you to 20MB/sec. To increase bandwidth on bad hardware/os, increase datagramsize (see config top) to max 16kb. On better hardware increase pps first, then datagramsize. One can reach up to 100MB/s constant traffic on 1GBit network np.

Closing this. Thanks for reporting :)