Closed Raycoms closed 7 years ago
I already changed the ports to different numbers, restarted the servers, pinged the others etc. Still not working.
I've been working with a codebase that uses that same constructor, but I've not been experiencing any problems. Have you tried to use the jconsole to fetch the JVM stacktrace and see where/why it is blocking?
I never experienced any problems using the same constructor until a week ago and I only merged my service from 2 instances per machine to 3 instances per machine (4 machines , 12 instances) But it already stopped loading when loading in only the first 4 instances (f = 1, n = 4, initial view = 0,1,2,3).
I'm executing it on a linux server and copied in the console output. How can I see that?
2017-05-05 20:21 GMT-03:00 jcs47 notifications@github.com:
I've been working with a codebase that uses that same constructor, but I've not been experiencing any problems. Have you tried to use the jconsole to fetch the JVM stacktrace and see where/why it is blocking?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bft-smart/library/issues/36#issuecomment-299597050, or mute the thread https://github.com/notifications/unsubscribe-auth/AGI9y5J_OrR_K7AbStn3FU08kYTaY6-qks5r2678gaJpZM4NSH0a .
Assuming you can connect directly to at least one of the servers without any firewall giving you an hassle (preferably replica 0), start one of the instances with these additional parameters:
-Dcom.sun.management.jmxremote.port=\<port of your choosing> -Dcom.sun.management.jmxremote.rmi.port=\<same port you used in the above parameter> -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false
In your local machine, type jconsole in the command line, choose "remote process" and connect to the machine using "host:port". Accept the insecure connection. Then go to the "threads" tab, and look for the thread that invokes that constructor. You will see the stack trace.
Don't forget to use the same port for the first two parameters.
Unfortunately, I only have indirect access to the servers. Is it possible that I launch that program on the server?
2017-05-05 20:39 GMT-03:00 jcs47 notifications@github.com:
Don't forget to use the same port for the first two parameters.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bft-smart/library/issues/36#issuecomment-299598837, or mute the thread https://github.com/notifications/unsubscribe-auth/AGI9y7-MCGQ4252d1aQLF-Zr4QB5hvBbks5r27MigaJpZM4NSH0a .
If you can access the servers with linux using ssh, you can launch jconsole in that machine if you use the -X flag while connecting via ssh. If the reason why you have indirect access is that you need to connect to a proxy first, don't forget to also use the -X flag when connecting to it. If you launch jconsole this way, connect to the instance via "local process" instead of "remote process". And don't forget to launch the replica before invoking jconsole.
I have to connect via ssh to a proxy first and from there I have to ssh again using a private key. Cant I install jconsole there?
2017-05-05 21:02 GMT-03:00 jcs47 notifications@github.com:
If you can access the servers with linux using ssh, you can launch jconsole in that machine if you use the -X flag while connecting via ssh. If the reason why you have indirect access is that you need to connect to a proxy first, don't forget to also use the -X flag when connecting to it. If you launch jconsole this way, connect to the instance via "local process" instead of "remote process". And don't forget to launch the replica before invoking jconsole.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bft-smart/library/issues/36#issuecomment-299601079, or mute the thread https://github.com/notifications/unsubscribe-auth/AGI9y5DsRGxVwdoRWKE7I4oKRIBaQegYks5r27iwgaJpZM4NSH0a .
jconsole comes bundled with oracle's jdk and also openjdk. Unless the machines have some other weird jdk instead of those, you don't have to worry about installing it.
So why can't I just run it on the particular machine in a second console? And also, where do I have to post those config options you pasted up there?
2017-05-05 21:36 GMT-03:00 jcs47 notifications@github.com:
jconsole comes bundled with oracle's jdk and also openjdk. Unless the machines have some other weird jdk instead of those, you don't have to worry about installing it.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bft-smart/library/issues/36#issuecomment-299603722, or mute the thread https://github.com/notifications/unsubscribe-auth/AGI9y92EQDF35gUnPT4GP6j42mro-4xYks5r28B_gaJpZM4NSH0a .
That is the idea... you launch two consoles connected to the same machine, one where you launch the replica, the other is where you launch jconsole. If you are launching jconsole directly inside the machine, you don't need to use the config options. Dont' forget to use the -X flag in both ssh commands.
Okay thanks will try that.
2017-05-05 21:48 GMT-03:00 jcs47 notifications@github.com:
That is the idea... you launch two consoles connected to the same machine, one where you launch the replica, the other is where you launch jconsole. If you are launching jconsole directly inside the machine, you don't need to use the config options. Dont' forget to use the -X flag in both ssh commands.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bft-smart/library/issues/36#issuecomment-299604578, or mute the thread https://github.com/notifications/unsubscribe-auth/AGI9yxoYu8OwMOeUAnE6uvHUfTHC96_zks5r28NEgaJpZM4NSH0a .
Ahh jconsole needs the ability to use windows. I can use gdb. Bt returned this:
at pthread_join.c:92
from /usr/lib/jvm/java-8-oracle/jre/bin/../lib/amd64/jli/libjli.so
from /usr/lib/jvm/java-8-oracle/jre/bin/../lib/amd64/jli/libjli.so
Will try to get more information
2017-05-06 01:35:00 Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.121-b13 mixed mode):
"Attach Listener" #75 daemon prio=9 os_prio=0 tid=0x00007f8650001000 nid=0x6cbb runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"DestroyJavaVM" #72 prio=5 os_prio=0 tid=0x00007f86b0009800 nid=0x689a waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"Server CS" #27 prio=5 os_prio=0 tid=0x00007f86b1c9c000 nid=0x68c8 waiting on condition [0x00007f8673136000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)
"TOM Layer" #68 prio=5 os_prio=0 tid=0x00007f86b1c99800 nid=0x68c7 waiting on condition [0x00007f8673237000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)
"Delivery Thread" #70 prio=5 os_prio=0 tid=0x00007f86b1c91000 nid=0x68c6 waiting on condition [0x00007f8673338000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)
"request timer" #69 prio=5 os_prio=0 tid=0x00007f86b1c8b800 nid=0x68c5 in Object.wait() [0x00007f8673618000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)
"nioEventLoopGroup-2-1" #35 prio=10 os_prio=0 tid=0x00007f86b1c82000 nid=0x68c4 runnable [0x00007f8673719000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
"Thread-3" #28 prio=5 os_prio=0 tid=0x00007f86b19e5800 nid=0x68c2 runnable [0x00007f867381a000] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at bftsmart.communication.server.ServersCommunicationLayer.run(ServersCommunicationLayer.java:221)
"Receiver for 3" #34 prio=5 os_prio=0 tid=0x00007f86b19d6000 nid=0x68c1 runnable [0x00007f867391b000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.net.SocketInputStream.read(SocketInputStream.java:224) at java.io.DataInputStream.readInt(DataInputStream.java:387) at bftsmart.communication.server.ServerConnection$ReceiverThread.run(ServerConnection.java:492)
"Sender for 3" #33 prio=5 os_prio=0 tid=0x00007f86b19d2000 nid=0x68c0 waiting on condition [0x00007f8673a1c000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)
"Receiver for 2" #32 prio=5 os_prio=0 tid=0x00007f86b19d0800 nid=0x68bf runnable [0x00007f8673b1d000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.net.SocketInputStream.read(SocketInputStream.java:224) at java.io.DataInputStream.readInt(DataInputStream.java:387) at bftsmart.communication.server.ServerConnection$ReceiverThread.run(ServerConnection.java:492)
"Sender for 2" #31 prio=5 os_prio=0 tid=0x00007f86b19ce800 nid=0x68be waiting on condition [0x00007f8673c1e000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)
"Receiver for 1" #30 prio=5 os_prio=0 tid=0x00007f86b19cc800 nid=0x68bd runnable [0x00007f8673d1f000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.net.SocketInputStream.read(SocketInputStream.java:224) at java.io.DataInputStream.readInt(DataInputStream.java:387) at bftsmart.communication.server.ServerConnection$ReceiverThread.run(ServerConnection.java:492)
"Sender for 1" #29 prio=5 os_prio=0 tid=0x00007f86b19cb800 nid=0x68bc waiting on condition [0x00007f86781a6000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)
"neo4j.PauseMonitor" #24 daemon prio=5 os_prio=0 tid=0x00007f86b082c000 nid=0x68ba in Object.wait() [0x00007f867930e000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.neo4j.kernel.impl.cache.MeasureDoNothing.run(MeasureDoNothing.java:54)
"New I/O server boss #4" #23 daemon prio=5 os_prio=0 tid=0x00007f86b07ac800 nid=0x68b9 runnable [0x00007f867960e000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
"New I/O worker #3" #22 daemon prio=5 os_prio=0 tid=0x00007f86b07b4800 nid=0x68b8 runnable [0x00007f8679710000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
"New I/O worker #2" #21 daemon prio=5 os_prio=0 tid=0x00007f86b07b4000 nid=0x68b7 runnable [0x00007f8679810000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
"New I/O worker #1" #20 daemon prio=5 os_prio=0 tid=0x00007f86b07b2800 nid=0x68b6 runnable [0x00007f8679912000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
"Silent channel reaper-1" #19 prio=5 os_prio=0 tid=0x00007f86b07a0000 nid=0x68b5 waiting on condition [0x00007f8679c13000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)
"neo4j.Scheduled-2" #18 daemon prio=5 os_prio=0 tid=0x00007f86b0687800 nid=0x68b4 waiting on condition [0x00007f8679d14000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)
"MuninnPageCache[1]-EvictionTask" #13 daemon prio=5 os_prio=0 tid=0x00007f86b0468800 nid=0x68af runnable [0x00007f867a215000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)
"neo4j.Scheduled-1" #12 daemon prio=5 os_prio=0 tid=0x00007f86b0439800 nid=0x68ae waiting on condition [0x00007f867a316000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)
"Neo4j UDC Timer" #10 daemon prio=5 os_prio=0 tid=0x00007f86b03e8800 nid=0x68ad in Object.wait() [0x00007f867a617000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.util.TimerThread.mainLoop(Timer.java:552)
"Service Thread" #9 daemon prio=9 os_prio=0 tid=0x00007f86b00d2000 nid=0x68ab runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"C1 CompilerThread3" #8 daemon prio=9 os_prio=0 tid=0x00007f86b00c7000 nid=0x68aa waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"C2 CompilerThread2" #7 daemon prio=9 os_prio=0 tid=0x00007f86b00c4800 nid=0x68a9 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007f86b00c3000 nid=0x68a8 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f86b00c0000 nid=0x68a7 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f86b00be800 nid=0x68a6 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f86b008b800 nid=0x68a5 in Object.wait() [0x00007f868832b000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)
"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f86b0087000 nid=0x68a4 in Object.wait() [0x00007f868842c000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)
"VM Thread" os_prio=0 tid=0x00007f86b007f800 nid=0x68a3 runnable
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f86b001f000 nid=0x689b runnable
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f86b0020800 nid=0x689c runnable
"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f86b0022800 nid=0x689d runnable
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f86b0024000 nid=0x689e runnable
"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007f86b0026000 nid=0x689f runnable
"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007f86b0027800 nid=0x68a0 runnable
"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007f86b0029800 nid=0x68a1 runnable
"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007f86b002b000 nid=0x68a2 runnable
"VM Periodic Task Thread" os_prio=0 tid=0x00007f86b00d4800 nid=0x68ac waiting on condition
JNI global references: 668
I can't see anything stange on the stacktraces afterwall. Can you reproduce this problem in your local machine? If you do, please prepare a folder with your codebase alongside with instructions about how to reproduce the problem and send it to me.
Yes it happens every single time:
The code: https://github.com/Raycoms/thesis
You build it with ./gradlew shadowjar
and run it:
java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 0 neo4j 0 0 true false java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 4 neo4j 1 0 false false java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 5 neo4j 2 0 false false java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 1 neo4j 0 1 true false java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 6 neo4j 1 1 false false java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 7 neo4j 2 1 false false java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 2 neo4j 0 2 true false java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 8 neo4j 1 2 false false java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 9 neo4j 2 2 false false java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 3 neo4j 0 3 true false java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 13 neo4j 1 3 false false java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 14 neo4j 2 3 false false
Every group of those is running on a seperate machine. The repository also contains the used configuration.
Running only the first of every group already suffices to reproduce the problem.
I can't find the source file of main.java.com.bag.server.ServerWrapper
I can't comple either:
FAILURE: Build failed with an exception.
Wait a minute...Do you happen do be trying to launch 12 replicas while having the system configured to use 4?
No, I'm launching 5 instances of bft-smart. 1 with 4 replicas and 4 with 3 replicas. But, the problem arises already when starting the first 4 of them. Did you check out the git and executed ./gradlew shadowjar inside the "thesis" folder?
2017-05-08 12:45 GMT-03:00 jcs47 notifications@github.com:
Wait a minute...Do you happen do be trying to launch 12 replicas while having the system configured to use 4?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bft-smart/library/issues/36#issuecomment-299905675, or mute the thread https://github.com/notifications/unsubscribe-auth/AGI9yzPtj0Te4nSZ8PcUhJyvfx_8inP_ks5r3ziFgaJpZM4NSH0a .
Yes I did, and I got the error I posted a few comments up.
Do you have gradle installed on your computer/server?
2017-05-08 14:20 GMT-03:00 jcs47 notifications@github.com:
Yes I did, and I got the error I posted a few comments up.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bft-smart/library/issues/36#issuecomment-299931669, or mute the thread https://github.com/notifications/unsubscribe-auth/AGI9y-HqdXTbnCWD803yFlAWCzb8M2ERks5r3071gaJpZM4NSH0a .
Oh, please don't check out Master. Check out "working" branch.
2017-05-08 14:23 GMT-03:00 Ray Neiheiser ray.neiheiser@gmail.com:
Do you have gradle installed on your computer/server?
2017-05-08 14:20 GMT-03:00 jcs47 notifications@github.com:
Yes I did, and I got the error I posted a few comments up.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bft-smart/library/issues/36#issuecomment-299931669, or mute the thread https://github.com/notifications/unsubscribe-auth/AGI9y-HqdXTbnCWD803yFlAWCzb8M2ERks5r3071gaJpZM4NSH0a .
I've managed to compile it under that branch, but I get this error when I invoke "java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 0 neo4j 0 0 true false":
Exception in thread "main" java.lang.NullPointerException
at main.java.com.bag.server.AbstractRecoverable.getSnapshot(AbstractRecoverable.java:243)
at bftsmart.tom.server.defaultservices.DefaultRecoverable.initLog(DefaultRecoverable.java:377)
at bftsmart.tom.server.defaultservices.DefaultRecoverable.setReplicaContext(DefaultRecoverable.java:400)
at bftsmart.tom.ServiceReplica.
Please, make sure that it is possible to launch all instances and replicas within the same machine and that the error is also reproduceable within that machine.
Okay I send an update it should work now. Able to launch it on 1 machine will be hard with the databases. Try it with "none" and not "neo4j" in the parameters to spawn an empty database so it works on 1 machine. Would have to configure bft-smart configs to 127.0.0.1 though.
2017-05-08 14:42 GMT-03:00 jcs47 notifications@github.com:
I've managed to compile it under that branch, but I get this error when I invoke "java -cp build/libs/1.0-0.1-Setup-fat.jar main.java.com.bag.server.ServerWrapper 0 neo4j 0 0 true false":
Exception in thread "main" java.lang.NullPointerException at main.java.com.bag.server.AbstractRecoverable.getSnapshot( AbstractRecoverable.java:243) at bftsmart.tom.server.defaultservices.DefaultRecoverable.initLog( DefaultRecoverable.java:377) at bftsmart.tom.server.defaultservices.DefaultRecoverable. setReplicaContext(DefaultRecoverable.java:400) at bftsmart.tom.ServiceReplica.(ServiceReplica.java:141) at main.java.com.bag.server.AbstractRecoverable.( AbstractRecoverable.java:124) at main.java.com.bag.server.GlobalClusterSlave.( GlobalClusterSlave.java:66) at main.java.com.bag.server.ServerWrapper.(ServerWrapper.java:81) at main.java.com.bag.server.ServerWrapper.main(ServerWrapper.java:267)
Please, make sure that it is possible to launch all instances and replicas within the same machine and that the error is also reproduceable within that machine.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bft-smart/library/issues/36#issuecomment-299937363, or mute the thread https://github.com/notifications/unsubscribe-auth/AGI9y8Xg3ZPBlBIEvhBUo_LuOEMyWPUwks5r31QJgaJpZM4NSH0a .
Ok, lets focus on just reproducing the error without the rest of your codebase. Prepare a little program with some simple application (eg., the counter client/server in the demo package) that is capable of reproducing the bug.
Seems like the bug you experienced above fixed it, For some reason that never printed out on the servers I am executing this.
Can you explain me why the server runs the getSnapshot before the constructor has finished?
Because the library needs to start execution with a snapshot of its state already in memory. This is needed because a replica may crash and recover before any other replica performs a snapshot at the interval specified in the configuration file. If they didn't begin execution already with one in memory, they would only have the log to send to the recovering replica but not the snapshot.
Still, it would be better to let the constructor finish and then shoot the snapshot to be sure that all sets, lists and data has been initialized properly.
2017-05-08 15:18 GMT-03:00 jcs47 notifications@github.com:
Because the library needs to start execution with a snapshot of its state already in memory. This is needed because a replica may crash and recover before any other replica performs a snapshot at the interval specified in the configuration file. If they didn't begin execution already with one in memory, they would only have the log to send to the recovering replica but not the snapshot.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bft-smart/library/issues/36#issuecomment-299947952, or mute the thread https://github.com/notifications/unsubscribe-auth/AGI9y2AeP2MK0DpkNpTCzXZUrTbcUSwMks5r31yMgaJpZM4NSH0a .
Additionally, it seems like in some cases Bft-smart seems to lose order. When the "getSnapshot" needs a long time to execute the order of executeBatch might get out of the order.
Hey there,
I am working with BFT-Smart in the context of my master thesis and it started to show some strange behavior after I set it up on 4 different servers.
It does get stuck at:
this.replica = new ServiceReplica(id,configDirectory, this, this, null, new DefaultReplier());
At all the servers:
Config home: global/config Config home in getViewStore: global/config Trying with alternative part: /home/ubuntu/thesis/global/config/currentView -- Creating current view from configuration file -- ID = 0 -- N = 4 -- F = 1 -- Port = 11300 -- requestTimeout = 2000 -- maxBatch = 400 -- Using MACs -- In current view: ID:0; F:1; Processes:0(/172.31.0.18:11300),1(/172.31.0.19:11310),2(/172.31.0.20:11320),3(/172.31.0.23:11330), (17/05/03 16:09:05 - TOM Layer) Running. (17/05/03 16:09:05 - TOM Layer) Next leader for CID=0: 0 (17/05/03 16:09:05 - TOM Layer) (TOMLayer.run) I'm the leader. -- Diffie-Hellman complete with 1 -- Diffie-Hellman complete with 3 -- Diffie-Hellman complete with 2
It does come on all server to completing Diffie-hellman but does then stop and not progress to the lines after it.
I'm quite close to finishing my thesis and help would be very welcome.
Thanks already,
Ray