akiradeveloper / dm-writeboost

Log-structured Caching for Linux
GNU General Public License v2.0
123 stars 19 forks source link

stop the subsequent flush jobs if a task is accidentally terminated #123

Closed akiradeveloper closed 8 years ago

akiradeveloper commented 8 years ago

I am not sure if this is truly to be considered.

when rambuf becomes full, a flush_job is made and queued into flush_wq. It's a singlethread_workqueue that is actually an ordered_workqueue that is, as described, executes the tasks one by one in queued order.

398 /**
399  * alloc_ordered_workqueue - allocate an ordered workqueue
400  * @fmt: printf format for the name of the workqueue
401  * @flags: WQ_* flags (only WQ_FREEZABLE and WQ_MEM_RECLAIM are meaningful)
402  * @args...: args for @fmt
403  *
404  * Allocate an ordered workqueue.  An ordered workqueue executes at
405  * most one work item at any given time in the queued order.  They are
406  * implemented as unbound workqueues with @max_active of one.
407  *
408  * RETURNS:
409  * Pointer to the allocated workqueue on success, %NULL on failure.
410  */
411 #define alloc_ordered_workqueue(fmt, flags, args...)                    \
412         alloc_workqueue(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args)

419 #define create_singlethread_workqueue(name)                             \
420         alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name)

But, what if a task is terminated for some really unknown reason? The behavior would be

  1. The workqueue ignore the killed task and continue the tasks left.
  2. The workqueue aborts itself.

If it's the first one, the caching device may corrupts because of the lacking side-effects.

I think we need at least an assertion at the beginning of the flush_proc so the flush_job->id = last_flushed_segment_id + 1 is satisfied. Otherwise should kill the worker.

akiradeveloper commented 8 years ago

If the catastrophe occurs,

akiradeveloper commented 8 years ago

The subsequent tasks should be killed because we shouldn't ack to the differed flush requests. Otherwise the application believes that the lacking data is persistent.

akiradeveloper commented 8 years ago

Retrying the flush job is too difficult. Since the possibility of this case is close to zero, I don't want to do much work for this.

akiradeveloper commented 8 years ago

If we have only two rambuffers and always wait for the another rambuffer is flushed before queuing itself? This way, we can keep only a single flush job in the wq and practically no performance regression.

And if the currently executing job is terminated, requeue it?

akiradeveloper commented 8 years ago

I will add an BUG_ON assertion into flush_proc in 2.2.2

akiradeveloper commented 8 years ago

If this happens, it causes #111. I want to test if the case happens in @onlyjob 's environment.

I have a fix for this issue in my mind. But I don't know how this happen (memory is broken and accessing the memory kills the thread?)