Intel-bigdata / SSM

Smart Storage Management for Big Data, a comprehensive hot/cold data optimized solution
Apache License 2.0
133 stars 67 forks source link

Recovery timeout occurred in dispatched state cmdlet #2215

Closed lipppppp closed 3 years ago

lipppppp commented 3 years ago

Shut down active smart server during cmdlet executing process in HA mode. The rule is 'file:every 5s | path matches "/compress/*" | compress'. There are two files in '/compress' path. image Then, the standby server is activated. The two cmdlets didn't recover normally. image The 446 cmldet which executed on standby server timeout occurred. image The 445 cmdlet which executed on agent is finished. But the cmdlet state is still DISPATCHED. image These failed tasks show that the two files are compressed. image image

lipppppp commented 3 years ago

The same issue exists in ec and other functions. We hope the cmdlet can be recovered normally when HA happens. And ensure that temperature information won't be lost.

PHILO-HE commented 3 years ago

Please see PR #2217.

PHILO-HE commented 3 years ago

The so-called recovery means, for unfinished action loaded from metastore, track its status like a normal action, and report an unsuccessful status generated for timeout action (long time no report). There is no issue for many actions. For such action, even though it is successfully executed actually, timeout report has no impact in rule case due to new action will be created to try again. And it is also OK for some action by just giving a timeout report to let user check the actual execution status. If getting actual status is necessary in SSM for taking over data temperature, or something else, we need to add support to speculate action status according to file state or other information.

lipppppp commented 3 years ago

Thanks for your support. This is a good solution. And I think it would be better if we could set timeout as a configuration item.

PHILO-HE commented 3 years ago

Fixed by #2217. For timeout configuration, you can consider to make TIMEOUT_MULTIPLIER configurable on your side if needed.

We had a largest status report interval (500ms by default) that is configurable. And considering timeout depends on report interval, we use this max report interval multiply TIMEOUT_MULTIPLIER to get timeout value. From my perspective, the current code may work for most general cases.