Clarify behavioral guarantees for plugin api

rerrabolu commented 1 year ago

Description of plugin api for external files is insufficient. It does not describe adequately their behavior when checkpointing & restoring a process tree consisting of multiple processes.

For example, consider a plugin binding that has multiple files per process - F1, F2, F3, etc. It is not clear as to what call sequence is guaranteed.

Scenario 1: dump_file(F1), dump_file(F2), dump_file(F3)
Scenario 2: dump_file(F2), dump_file(F3), dump_file(F1)
Scenario 3: dump_file(F3), dump_file(F1), dump_file(F2)

It is not clear from existing api description if there is a definite sequence/order in which these will be invoked. Without this baseline it is difficult to determine the list of artifacts that can be obtained and passed between the different calls - dump_file(F1), dump_file(F2), dump_file(F3), etc.

@note: Going forward dump_file(F1) will be referenced as F1() - similarly for others

It is also not clear how checkpointing and restore works when a process tree has more than one process. Will the handling be serialized or concurrent. For example is it legal for checkpoint calls P1::F1() and P2::F1() to be concurrent. @note: Ignore calls to gating api's unpause() and resume()

Documentation of amdgpu plugin makes a brief reference to the context of dumping and restoring - LINK. However this does not fully answer questions raised above. Is it legal for the dumper process to call P1::F1() and P2::F2() or any permutation of the six calls involved.

In the current scheme for amdgpu plugin the call to unpause() a checkpointed process occurs towards the end of the call to dump_file(), a plugin api. Not sure as to how is this supposed to work when the process being checkpointed has more than one file that should be dumped. Should not this be similar to how resume() works?

I think plugin api description should clarify these aspects. this will enable implementations to build and cache artifacts that could be used in subsequent calls.

rst0git commented 1 year ago

It is not clear from existing api description if there is a definite sequence/order in which these will be invoked.

To checkpoint opened files, CRIU iterates through an array of file descriptors (FDs) in dump_task_files_seized() and calls dump_one_file() for each FD. The sequence in which these FDs are checkpointed is determined by the order they are added to the array in collect_fds(). In collect_fds() we use readdir() to collect the FDs from /proc/<PID>/fd.

According to the note in man 3 readdir: the order in which filenames are read by successive calls to readdir() depends on the filesystem implementation; it is unlikely that the names will be sorted in any fashion.

It is also not clear how checkpointing and restore works when a process tree has more than one process. Will the handling be serialized or concurrent.

The handling is serialized. CRIU iterates through each task using deep first search. See pstree_item_next(), cr_dump_tasks() and restore_root_task().

In the current scheme for amdgpu plugin the call to unpause() a checkpointed process occurs towards the end of the call to dump_file(), a plugin api. Not sure as to how is this supposed to work when the process being checkpointed has more than one file that should be dumped. Should not this be similar to how resume() works?

I'm assuming that you are referring to the unpause_process() call at the end of amdgpu_plugin_dump_file(). This functionality was introduced with commit https://github.com/checkpoint-restore/criu/commit/55a5993bc73a6d2e9551f275c78e0907c5dff686 and perhaps @dayatsin-amd might be able to help?

rerrabolu commented 1 year ago

RErrabolu: First of all let me thank for taking time to comment and noting a few aspects either via links or method names

It is not clear from existing api description if there is a definite sequence/order in which these will be invoked.

To checkpoint opened files, CRIU iterates through an array of file descriptors (FDs) in dump_task_files_seized() and calls dump_one_file() for each FD. The sequence in which these FDs are checkpointed is determined by the order they are added to the array in collect_fds(). In collect_fds() we use readdir() to collect the FDs from /proc/<PID>/fd.

RErrabolu: Thanks for describing the high-level call sequence that is involved in checkpointing. Per this statement making any assumptions regarding the order of calls is incorrect. A working implementation might break overnight should the order change.

According to the note in man 3 readdir: the order in which filenames are read by successive calls to readdir() depends on the filesystem implementation; it is unlikely that the names will be sorted in any fashion.

It is also not clear how checkpointing and restore works when a process tree has more than one process. Will the handling be serialized or concurrent.

The handling is serialized. CRIU iterates through each task using deep first search. See pstree_item_next(), cr_dump_tasks() and restore_root_task().

RErrabolu: Thanks for noting that CRIU framework employs a scheme that has the properties of being serialized and depth-first (DFS) which checkpointing a process tree.

RErrabolu: Per my experience restore seems to involve concurrency. Multiple calls, to restore device files, from CRIU framework are in-flight concurrently. It is not clear if restore traces its path from leaf nodes to the root.

In the current scheme for amdgpu plugin the call to unpause() a checkpointed process occurs towards the end of the call to dump_file(), a plugin api. Not sure as to how is this supposed to work when the process being checkpointed has more than one file that should be dumped. Should not this be similar to how resume() works?

I'm assuming that you are referring to the unpause_process() call at the end of amdgpu_plugin_dump_file(). This functionality was introduced with commit 55a5993 and perhaps @dayatsin-amd might be able to help?

RErrabolu: All in all I feel, in my eval, CRIU should allow plugins to specify a rule base which could guide the checkpoint and restore procedures.

rst0git commented 1 year ago

@rerrabolu checkpoint/restore of processes that use GPUs is a relatively new CRIU feature. I also encountered a couple of problems when evaluating it (https://github.com/checkpoint-restore/criu/issues/2248). If you want to add a feature, fix a bug, or implement missing functionality, feel free to do so! Patches are welcome!

avagin commented 1 year ago

RErrabolu: Per my experience restore seems to involve concurrency. Multiple calls, to restore device files, from CRIU framework are in-flight concurrently. It is not clear if restore traces its path from leaf nodes to the root.

You are right. On restore, all tasks are restored concurrently and they are synchronized between stages: https://github.com/checkpoint-restore/criu/blob/5de9040ee758f1fd1a2599b6f800013544c966b6/criu/include/restorer.h#L256

File descriptors are restored in open_fd(): https://github.com/checkpoint-restore/criu/blob/5de9040ee758f1fd1a2599b6f800013544c966b6/criu/files.c#L1142

You can find that a file restore callback can return 1 if the file can't be restored due to dependencies to other files. In such case, CRIU tries to restore other files and then do another round to restore files skipped on a previous iteration: https://github.com/checkpoint-restore/criu/blob/5de9040ee758f1fd1a2599b6f800013544c966b6/criu/files.c#L1233

We use cross-process mutex-es and unix sockets to do required synchronizations between processes.

RErrabolu: All in all I feel, in my eval, CRIU should allow plugins to specify a rule base which could guide the checkpoint and restore procedures.

It is unclear what exactly you need here. You can describe with all details what problem you are working on, and we will help to handle it in CRIU. Or you can propose changes of the plugin interface and we will discuss them to find the right solution.

dayatsin-amd commented 1 year ago

I'm assuming that you are referring to the unpause_process() call at the end of amdgpu_plugin_dump_file(). This functionality was introduced with commit 55a5993 and perhaps @dayatsin-amd might be able to help?

During dump/checkpoint, the first IOCTL that the plugin does into the amdgpu drivers (PROCESS_INFO) will cause the amdgpu drivers to pause all the queues associated with this process. This is done so that this process is effectively paused/frozen during the checkpoint process. Once the checkpoint is complete (or if the checkpoint fails), we need to do the UNPAUSE ioctl to resume the queues.

rerrabolu commented 1 year ago

In the current design dumping of device state of a process is accomplished by following event sequence:

for each proc in procList
    Event 1:
       Pause the process
          Invoke Ioctl on /dev/kfd
    Event 2:
       Dump the process state
          Invoke Ioctl on /dev/kfd
       Unpause the process
          Invoke Ioctl on /dev/kfd
    @note: The call to dump and unpause occurs as part of one call frame.

The operations to dump and unpause are opaque to CRIU framework. CRIU rightfully does not care as to what is being done in these operations.

The current design becomes limiting if the event that dumps needs to handle more than one device { /dev/kfd and /dev/dri/renderD }. This is illustrated by the following event sequence:

for each proc in procList Event 1: Pause the process Invoke Ioctl on /dev/kfd Event 2: for each device from deviceList Dump process state of each device Invoke Ioctl on /dev/kfd or /dev/dri/renderD<i> @note: Order of enumeration to dump is determine by CRIU Event 3: Unpause the process Invoke Ioctl on /dev/kfd @note: The calls to dump and unpause occur as part of different call frames.

How to encode the above event sequence to Checkpoint, in a generic manner, should be thought out.

@note: I am not sure if CRIU can handle C&R of a process that has state across two or more devices. For example an application that captures data from a Camera (/dev/)and uses GPU (/dev/GPU) to process it. I do not think it is possible to pause/continue a process on all devices (CPU, GPU, CAMERA, et) in an atomic manner.

checkpoint-restore / criu

Clarify behavioral guarantees for plugin api #2277