Optimize the fault recovery of the jvm oom

hengyoush commented 1 year ago

Issue Description

Describe what happened (or what feature you want)

Currently, there are some problems with jvm oom, especially heap oom failure recovery. Currently, you can only destroy by launching the destroy command, but in fact, after failure injection, a large number of FullGC happened (also wild-mode=false) in JVM, sandbox can no longer receive processing commands properly. In this case（In Fact, the jetty thread of sandbox did not set uncaughtExceptionHandler, and the processing command caused the OOM thread to exit）, we can only restart the jvm。目前jvm的oom尤其是heap oom故障的恢复存在一些问题，目前只能通过发起destroy命令，但实际上故障注入后，JVM大量FGC（wild-mode=false也是这样），sandbox已经没办法正常接收处理命令了（而且sandbox的jetty线程没有设置uncaughtExceptionHandler，处理命令导致OOM，线程退出），这时只能重启恢复。

Describe what you expected to happen

The fault can be automatically recovered without restart. 故障能够自动恢复不需要重启。

So far I've come up with a few optimization plans: 1.In JVM OOM scenarios, add the action parameter ‘timeout’: actually take the timeout parameter of blade command with it (currently there is a special treatment for timeout in chaosblade without passing it to sandbox). 2.Another thread is started to release memory periodically. 目前我想到了几种优化方式：

JVM OOM场景增加action参数timeout：实际上是把blade命令的timeout参数带上（目前在chaosblade里对timeout参数做了特殊处理没有传入到sandbox）
另外启动一个线程，定时释放内存：但是仍然需要依赖Destroy命令，所以这个方法并不能一定保证能够恢复。

How to reproduce it (as minimally and precisely as possible)

Tell us your environment

Anything else we need to know?

hengyoush commented 1 year ago

If we indeed need this optimization, I can do this work

binbin0325 commented 1 year ago

针对方案1 可以给一些详细的设计

binbin0325 commented 1 year ago

hengyoush commented 1 year ago

针对方案1 可以给一些详细的设计

给JVM的OOM场景增加一个actionFlag：timeout
去掉chaosblade命令行工具对timeout的过滤（这一步会让timeout传到java agent）
chaos的agent中解析Http请求时，只放到actionModel中，不放到matcherModel中，防止machterModel出现timeout导致无法匹配（具体方法：在com.alibaba.chaosblade.exec.service.handler.ModelParser的assistantFlag增加“timeout”）
在JvmOomExecutor中开始故障注入时，除了故障注入线程之外，再启动一个线程，sleep对应timeout时间后调用stop方法停止故障注入

chaosblade-io / chaosblade