TIBCOSoftware / snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
http://www.snappydata.io
Other
1.04k stars 200 forks source link

snappy data servers cannot start after crashing #686

Open ghost opened 7 years ago

ghost commented 7 years ago

environment:

snappydata 0.9

4 machine:
node73.st: locator
node74.st: locator
node75.st: server
node76.st: server, lead

snappy data servers crashed when I inserting rows into table.

executor crash log:

17/07/05 15:00:57.228 CST Executor task launch worker-74<tid=0x921> ERROR SnappyExecutor: Exception in task 40.1 in stage 20.0 (TID 816)
com.gemstone.gemfire.cache.CacheClosedException: A cache has not yet been created.
        at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.getExisting(GemFireCacheImpl.java:844)
        at com.pivotal.gemfirexd.internal.engine.Misc.getGemFireCache(Misc.java:186)
        at com.pivotal.gemfirexd.internal.engine.Misc.getRegion(Misc.java:444)
        at com.pivotal.gemfirexd.internal.engine.Misc.getRegionForTable(Misc.java:416)
        at org.apache.spark.sql.execution.columnar.impl.JDBCSourceAsColumnarStore.getPartitionID(JDBCSourceAsColumnarStore.scala:383)
        at org.apache.spark.sql.execution.columnar.impl.JDBCSourceAsColumnarStore.storeColumnBatch(JDBCSourceAsColumnarStore.scala:72)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.columninsert_storeColumnBatch$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenRDD$$anon$1.hasNext(WholeStageCodegenExec.scala:510)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:79)
        at org.apache.spark.scheduler.Task.run(Task.scala:107)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:277)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
...
17/07/05 15:00:59.327 CST ReconnectThread<tid=0x935> WARN StorageMemoryPool: Attempted to release 2980 bytes of storage memory when we only have 0 bytes
17/07/05 15:00:59.328 CST ReconnectThread<tid=0x935> INFO snappystore: closing DiskStore[GFXD-DEFAULT-DISKSTORE]
17/07/05 15:00:59.329 CST ReconnectThread<tid=0x935> INFO snappystore: Unlocked disk store GFXD-DEFAULT-DISKSTORE
17/07/05 15:00:59.329 CST ReconnectThread<tid=0x935> INFO snappystore: Stopping DiskStoreTaskPool
17/07/05 15:00:59.329 CST ReconnectThread<tid=0x935> INFO snappystore: closing DiskStore[GFXD-DD-DISKSTORE]
17/07/05 15:00:59.331 CST ReconnectThread<tid=0x935> INFO snappystore: Unlocked disk store GFXD-DD-DISKSTORE
17/07/05 15:00:59.331 CST ReconnectThread<tid=0x935> INFO snappystore: Stopping DiskStoreTaskPool
17/07/05 15:00:59.334 CST ReconnectThread<tid=0x935> INFO snappystore: Shutting down DistributionManager node76(5027)<v3>:62167. At least one Exception occurred.
...
17/07/05 15:02:00.574 CST ReconnectThread<tid=0x935> INFO snappystore: GemFire P2P Listener started on  tcp:///172.20.2.76:22090
17/07/05 15:02:01.206 CST PingSender<tid=0x965> INFO snappystore: locator node74(null)<v0>:9990 member address is node74(12942)<v1>:25305
17/07/05 15:02:01.210 CST PingSender<tid=0x965> INFO snappystore: locator node73(null)<v0>:9990 member address is node73(33726)<v0>:23591
17/07/05 15:02:01.211 CST ReconnectThread<tid=0x935> INFO snappystore: Attempting to join distributed system whose membership coordinator is node73(33726)<v0>:23591 using membership ID node76:62167
17/07/05 15:02:01.364 CST ReconnectThread<tid=0x935> INFO snappystore: Membership: lead member is now node73(33726)<v0>:23591
17/07/05 15:02:01.369 CST FD_SOCK Ping thread<tid=0x967> INFO snappystore: GemFire failure detection is now monitoring node73(33726)<v0>:23591
17/07/05 15:02:01.370 CST ReconnectThread<tid=0x935> INFO snappystore: Entered into membership with ID node76(5027)<v9>:62167.
17/07/05 15:02:01.370 CST ReconnectThread<tid=0x935> INFO snappystore: Starting DistributionManager node76(5027)<v9>:62167. 
17/07/05 15:02:01.370 CST ReconnectThread<tid=0x935> INFO snappystore: Initial (membershipManager) view =  [node73(33726)<v0>:23591{lead}, node74(12942)<v1>:25305, node76(5584)<v4>:39162, node75<v8>:45932, node76<v9>:62167]
17/07/05 15:02:01.371 CST ReconnectThread<tid=0x935> INFO snappystore: Admitting member <node73(33726)<v0>:23591>. Now there are 1 non-admin member(s).
17/07/05 15:02:01.371 CST ReconnectThread<tid=0x935> INFO snappystore: Admitting member <node74(12942)<v1>:25305>. Now there are 2 non-admin member(s).
17/07/05 15:02:01.371 CST ReconnectThread<tid=0x935> INFO snappystore: Admitting member <node76(5584)<v4>:39162>. Now there are 3 non-admin member(s).
17/07/05 15:02:01.371 CST ReconnectThread<tid=0x935> INFO snappystore: Admitting member <node75<v8>:45932>. Now there are 4 non-admin member(s).
17/07/05 15:02:01.371 CST ReconnectThread<tid=0x935> INFO snappystore: Admitting member <node76<v9>:62167>. Now there are 5 non-admin member(s).
17/07/05 15:02:01.375 CST UDP ucast receiver<tid=0x961> INFO snappystore: Member node74(12942)<v1>:25305 is not equivalent or in the same redundancy zone.
17/07/05 15:02:01.377 CST ReconnectThread<tid=0x935> INFO snappystore: DistributionManager node76(5027)<v9>:62167 started on node74.st[9990],node73.st[9990]. There were 4 other DMs. others: [node75<v8>:45932, node73(33726)<v0>:23591, node74(12942)<v1>:25305, node76(5584)<v4>:39162]  
17/07/05 15:02:01.377 CST UDP ucast receiver<tid=0x961> INFO snappystore: Member node73(33726)<v0>:23591 is not equivalent or in the same redundancy zone.
17/07/05 15:02:01.378 CST UDP ucast receiver<tid=0x961> INFO snappystore: Member node76(5584)<v4>:39162 is equivalent or in the same redundancy zone.
17/07/05 15:02:01.379 CST UDP ucast receiver<tid=0x961> INFO snappystore: Member node75<v8>:45932 is not equivalent or in the same redundancy zone.
17/07/05 15:02:01.384 CST Thread-1576 StatSampler<tid=0x969> INFO snappystore: Renamed old existing archive to "snappyserver-10-01.gfs".
17/07/05 15:02:01.385 CST ReconnectThread<tid=0x935> INFO snappystore: rebooting GemFireXD server instance
17/07/05 15:02:01.386 CST ReconnectThread<tid=0x935> WARN snappystore: (tid=2357 msgId=3667) Failed to set options {OPT_KEEPIDLE=20, OPT_KEEPCNT=10, OPT_KEEPINTVL=1} on socket: java.lang.UnsupportedOperationException: setting native socket options [OPT_KEEPIDLE, OPT_KEEPCNT, OPT_KEEPINTVL] not possible in generic implementation
17/07/05 15:02:01.391 CST Thread-1576 StatSampler<tid=0x969> INFO snappystore: Archiving statistics to "snappyserver.gfs".
17/07/05 15:02:04.313 CST UDP ucast receiver<tid=0x961> INFO snappystore: Received Suspect notification for member(s) [node75<v8>:45932] from node76(5584)<v4>:39162.  Reason=Socket was not closed nicely
17/07/05 15:02:04.314 CST UDP ucast receiver<tid=0x961> INFO snappystore: Received Suspect notification for member(s) [node75<v8>:45932] from node76(5584)<v4>:39162. 
17/07/05 15:02:06.398 CST ReconnectThread<tid=0x935> INFO snappystore: TraceFabricServiceBoot: Stopped Thrift Network interface node76.st/172.20.2.76[1527] successfully. status RECONNECTING
17/07/05 15:02:06.399 CST pool-3-thread-2<tid=0x1> ERROR SnappyUncaughtExceptionHandler: Uncaught exception in thread Thread[pool-3-thread-2,5,main]

sbin/snappy-servers.sh start

sbin/snappy-servers.sh start                                                                                                                                                
node75.st: java.lang.IllegalStateException: A SnappyData Server is already running in directory "/snappydata/snappydata-0.9-data/server"
node75.st:  SnappyData Server pid: 30055 status: stopping
node75.st:      at com.gemstone.gemfire.internal.cache.CacheServerLauncher.verifyAndClearStatus(CacheServerLauncher.java:673)
node75.st:      at com.gemstone.gemfire.internal.cache.CacheServerLauncher.start(CacheServerLauncher.java:652)
node75.st:      at com.pivotal.gemfirexd.tools.internal.GfxdServerLauncher.run(GfxdServerLauncher.java:796)
node75.st:      at io.snappydata.tools.ServerLauncher.run(GfxdLauncherOverrides.scala:38)
node75.st:      at io.snappydata.tools.ServerLauncher$.main(GfxdLauncherOverrides.scala:57)
node75.st:      at io.snappydata.tools.ServerLauncher.main(GfxdLauncherOverrides.scala)
node75.st:      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
node75.st:      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
node75.st:      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
node75.st:      at java.lang.reflect.Method.invoke(Method.java:498)
node75.st:      at com.gemstone.gemfire.internal.GemFireUtilLauncher.invoke(GemFireUtilLauncher.java:176)
node75.st:      at com.pivotal.gemfirexd.tools.GfxdUtilLauncher.invoke(GfxdUtilLauncher.java:257)
node75.st:      at io.snappydata.tools.SnappyUtilLauncher.invoke(SnappyUtilLauncher.scala:91)
node75.st:      at io.snappydata.tools.SnappyUtilLauncher$.main(SnappyUtilLauncher.scala:137)
node75.st:      at io.snappydata.tools.SnappyUtilLauncher.main(SnappyUtilLauncher.scala)
node75.st: Error starting server process: A SnappyData Server is already running in directory "/snappydata/snappydata-0.9-data/server"
node75.st:  SnappyData Server pid: 30055 status: stopping
node76.st: java.lang.IllegalStateException: A SnappyData Server is already running in directory "/snappydata/snappydata-0.9-data/server"
node76.st:  SnappyData Server pid: 5027 status: stopping
node76.st:      at com.gemstone.gemfire.internal.cache.CacheServerLauncher.verifyAndClearStatus(CacheServerLauncher.java:673)
node76.st:      at com.gemstone.gemfire.internal.cache.CacheServerLauncher.start(CacheServerLauncher.java:652)
node76.st:      at com.pivotal.gemfirexd.tools.internal.GfxdServerLauncher.run(GfxdServerLauncher.java:796)
node76.st:      at io.snappydata.tools.ServerLauncher.run(GfxdLauncherOverrides.scala:38)
node76.st:      at io.snappydata.tools.ServerLauncher$.main(GfxdLauncherOverrides.scala:57)
node76.st:      at io.snappydata.tools.ServerLauncher.main(GfxdLauncherOverrides.scala)
node76.st:      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
node76.st:      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
node76.st:      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
node76.st:      at java.lang.reflect.Method.invoke(Method.java:498)
node76.st:      at com.gemstone.gemfire.internal.GemFireUtilLauncher.invoke(GemFireUtilLauncher.java:176)
node76.st:      at com.pivotal.gemfirexd.tools.GfxdUtilLauncher.invoke(GfxdUtilLauncher.java:257)
node76.st:      at io.snappydata.tools.SnappyUtilLauncher.invoke(SnappyUtilLauncher.scala:91)
node76.st:      at io.snappydata.tools.SnappyUtilLauncher$.main(SnappyUtilLauncher.scala:137)
node76.st:      at io.snappydata.tools.SnappyUtilLauncher.main(SnappyUtilLauncher.scala)
node76.st: Error starting server process: A SnappyData Server is already running in directory "/snappydata/snappydata-0.9-data/server"
node76.st:  SnappyData Server pid: 5027 status: stopping

When I stop all :

./sbin/snappy-stop-all.sh 
node76.st: The SnappyData Leader on node76.st has stopped.
node76.st: Timeout waiting for SnappyData Server to shutdown on node76.st, status is: SnappyData Server pid: 5027 status: stopping
node75.st: Timeout waiting for SnappyData Server to shutdown on node75.st, status is: SnappyData Server pid: 30055 status: stopping
node74.st: The SnappyData Locator on node74.st has stopped.
node73.st: The SnappyData Locator on node73.st has stopped.

But on node75.st, ps -ef | grep snappy got nothing. Is there some pid files I should delete manually ? What should I do to get the data servers to start up. Thanks.

sumwale commented 7 years ago

Can you provide the full log of the server that failed? Also the lead log.

Since the status showed "stopping" it is taking a while to stop. Run snappy-stop-all again which should show stopped and then you can start the servers.

ghost commented 7 years ago

I ran ./sbin/snappy-stop-all.sh 6 times. It's still the same.

./sbin/snappy-stop-all.sh 
node76.st: Timeout waiting for SnappyData Server to shutdown on node76.st, status is: SnappyData Server pid: 5027 status: stopping
node75.st: Timeout waiting for SnappyData Server to shutdown on node75.st, status is: SnappyData Server pid: 30055 status: stopping
node74.st: The specified working directory (/snappydata/snappydata-0.9-data/locator) on node74.st contains no status file
node73.st: The specified working directory (/snappydata/snappydata-0.9-data/locator) on node73.st contains no status file

BTW, I have uploaded the log files of leads and servers. Thanks.

snappyserver_node76.log.txt snappyleader.log.txt

sumwale commented 7 years ago

@niko2014 I think what happened is that server was forced out of distributed system because it was unresponsive -- see server log:

17/07/05 15:00:53.398 CST CloserThread<tid=0x935> ERROR snappystore: Membership service failure: Channel closed: com.gemstone.gemfire.ForcedDisconnectException: This member has been forced out of the distributed system.  Reason='did not respond to are-you-dead messages'
17/07/05 15:00:53.408 CST Executor task launch worker-75<tid=0x922> INFO snappystore: Region /SNAPPYSYS_INTERNAL/APP____TEST_APPEND_070502_COLUMN_STORE_ putAll: Key ColumnKey(columnIndex=1,partitionId=103,uuid=00000001-0000-00c6-8ae5-6d680d6a231f) and possibly others failed to put due to

After this server tried to automatically reconnect to the distributed system but then it was told to stop. There might be some deadlock situation for such a scenario. It is safe to force kill the server at this point with the PIDs being printed 30055 and 5027.

Coming to the main issue of member becoming unresponsive. The most likely reason we have seen this happen in testing is due to long GC pauses. We have eliminated nearly all garbage from most of the store insert paths but unfortunately the Parquet reader generates huge amounts of it. Especially if table is partitioned then this is combined with the garbage from EXCHANGE causes lot of trouble. If you have lots of headroom in the heap then it works out but otherwise it can cause this issue.

There are two options to avoid this:

a) Use off-heap which is recommended especially for the case if entire table data will not comfortably fit in memory. Use "-memory-size" option in conf/servers which causes all column tables to use off-heap (row tables do not have a tested off-heap support yet). Will still need a decent heap-size for normal JVM functioning and because some components like Spark parquet reader do not have off-heap support. Recommended is heap-size about 8g and memory-size as much as you can afford. For very wide schemas (> 100 columns) the parquet reader will need more room then heap-size should be minimum 16g. For your case something like:

server1  -heap-size=8g -memory-size=20g ...

Keep the total heap-size+memory-size at max 80-90% of available physical RAM on the node and going into swapping is much worse than table eviction. Having some amount of swap (min 32g) is recommended in any case to avoid the machine running out of physical RAM due to other processes, JVM overhead etc. If node starts running out of physical RAM then the Linux OOM killer will target the biggest process first (i.e. the snappydata server process).

b) Increase the member-timeout to something like 30s: -member-timeout=30000 in all of conf/locators, leads and servers.

You may want to have b) in any case if quick node departure detection is not required for your use-case.

sumwale commented 7 years ago

@niko2014 Did you manage to try with off-heap of increased member-timeout? The newer releases increase the member-timeout by default to 30s.