Open ghost opened 7 years ago
Can you provide the full log of the server that failed? Also the lead log.
Since the status showed "stopping" it is taking a while to stop. Run snappy-stop-all again which should show stopped and then you can start the servers.
I ran ./sbin/snappy-stop-all.sh
6 times. It's still the same.
./sbin/snappy-stop-all.sh
node76.st: Timeout waiting for SnappyData Server to shutdown on node76.st, status is: SnappyData Server pid: 5027 status: stopping
node75.st: Timeout waiting for SnappyData Server to shutdown on node75.st, status is: SnappyData Server pid: 30055 status: stopping
node74.st: The specified working directory (/snappydata/snappydata-0.9-data/locator) on node74.st contains no status file
node73.st: The specified working directory (/snappydata/snappydata-0.9-data/locator) on node73.st contains no status file
BTW, I have uploaded the log files of leads and servers. Thanks.
@niko2014 I think what happened is that server was forced out of distributed system because it was unresponsive -- see server log:
17/07/05 15:00:53.398 CST CloserThread<tid=0x935> ERROR snappystore: Membership service failure: Channel closed: com.gemstone.gemfire.ForcedDisconnectException: This member has been forced out of the distributed system. Reason='did not respond to are-you-dead messages'
17/07/05 15:00:53.408 CST Executor task launch worker-75<tid=0x922> INFO snappystore: Region /SNAPPYSYS_INTERNAL/APP____TEST_APPEND_070502_COLUMN_STORE_ putAll: Key ColumnKey(columnIndex=1,partitionId=103,uuid=00000001-0000-00c6-8ae5-6d680d6a231f) and possibly others failed to put due to
After this server tried to automatically reconnect to the distributed system but then it was told to stop. There might be some deadlock situation for such a scenario. It is safe to force kill the server at this point with the PIDs being printed 30055 and 5027.
Coming to the main issue of member becoming unresponsive. The most likely reason we have seen this happen in testing is due to long GC pauses. We have eliminated nearly all garbage from most of the store insert paths but unfortunately the Parquet reader generates huge amounts of it. Especially if table is partitioned then this is combined with the garbage from EXCHANGE causes lot of trouble. If you have lots of headroom in the heap then it works out but otherwise it can cause this issue.
There are two options to avoid this:
a) Use off-heap which is recommended especially for the case if entire table data will not comfortably fit in memory. Use "-memory-size" option in conf/servers which causes all column tables to use off-heap (row tables do not have a tested off-heap support yet). Will still need a decent heap-size for normal JVM functioning and because some components like Spark parquet reader do not have off-heap support. Recommended is heap-size about 8g and memory-size as much as you can afford. For very wide schemas (> 100 columns) the parquet reader will need more room then heap-size should be minimum 16g. For your case something like:
server1 -heap-size=8g -memory-size=20g ...
Keep the total heap-size+memory-size at max 80-90% of available physical RAM on the node and going into swapping is much worse than table eviction. Having some amount of swap (min 32g) is recommended in any case to avoid the machine running out of physical RAM due to other processes, JVM overhead etc. If node starts running out of physical RAM then the Linux OOM killer will target the biggest process first (i.e. the snappydata server process).
b) Increase the member-timeout to something like 30s: -member-timeout=30000 in all of conf/locators, leads and servers.
You may want to have b) in any case if quick node departure detection is not required for your use-case.
@niko2014 Did you manage to try with off-heap of increased member-timeout? The newer releases increase the member-timeout by default to 30s.
environment:
snappy data servers crashed when I inserting rows into table.
executor crash log:
sbin/snappy-servers.sh start
When I stop all :
But on node75.st,
ps -ef | grep snappy
got nothing. Is there some pid files I should delete manually ? What should I do to get the data servers to start up. Thanks.