Open mokuki082 opened 6 years ago
I have listed down all of OCI runtime spec in a list structure. This is a reference to track our implementation progress, it is recommended to look up the official OCI runtime spec for documentation and data structures prior to implementation.
This is basically a copy of the OCI runtime spec although I've taken out some non-Linux related specifications as they are not necessary at the current stage. I wasn't entirely sure on the format or the usecase for this documentation so I have put in everything that I thought could be used for either implementing the runtime or generating test cases.
OCI specification for stand container defines:
The goal is to create a container that is portable, content-agnostic, infrastructure-agnostic, with self-describing dependencies.
A filesystem bundle MUST consists of:
config.json
fileconfig.json
's root.path
property.Runtimes must support the following operations:
state <container-id>
create <container-id> <path-to-bundle>
config.json
except for process
MUST be applied.process.args
MUST NOT be applied until triggered by the start operation. The remaining process properties MAY be applied by this operation.config.json
against this spec, either generically or with respect to the local system capabilities, before creating the container.start <container-id>
kill <container-id> <signal>
delete <container-id>
The configuration file contains metadata necessary to implement standard operations against the container. Detailed description (for all OSes) of each field can be found here.
root
(object, OPTIONAL) specifies container's root fs.
path
(string, REQUIRED) specifies path to the root fs for the container.path
can be absolute or relative to the bundle path.rootfs
readonly
(bool, OPTIONAL) if true then the root fs MUST be read-only inside the container, defaults to false.mounts
(array of objects, OPTIONAL) specifies additional mounts beyond root
.
mount(2)
.destination
(string, REQUIRED) path inside container. MUST be absolute path.source
(string, OPTIONAL) a device name, can be a directory name or a dummy. Path values are either absolute or relative to the bundle.options
(array of strings, OPTIONAL) mount options of the filesystem to be used.mount(8)
.mount
structure has the following fieldstype
(string, OPTIONAL) the type of the fs to be mounted.
/proc/filesystems
process
(object, OPTIONAL) specifies the container process. This property is REQUIRED when start
is called.
terminal
(bool, OPTIONAL) specifies whether a terminal is attached to the process, defaults to false.consoleSize
(object, OPTIONAL) specifies the console size in characters of the terminal.terminal
is set to false
height
(uint, REQUIRED)width
(uint, REQUIRED)cwd
(string, REQUIRED) the working directory set for the executable. MUST be absolute path.env
(array of strings, OPTIONAL) with the same semantics as IEEE Std 1003.1-2008 environ
args
(array of strings, REQUIRED) with similar semantics to IEEE Std 1003.1-2008 execvp
's argv'execvp
's file.rlimits
(array of objects, OPTIONAL) allows setting resource limits for the process.
type
(string, REQUIRED)getrlimit(2)
.rlim
refers to the status returned by the getrlimit(3)
call.soft
(uint64, REQUIRED) the value of the limit enforced for the corresponding resource. rlim.rlim_cur
MUST match this value.hard
(uint64, REQUIRED) the ceiling for the soft limit that could be set by an unprivileged process. rlim.rlim_max
MUST match this value.CAP_SYS_RESOURCE
) can raise a hard limit.user
(object)
uid
(int, REQUIRED) specifies the user ID in the container namespace.gid
(int, REQUIRED) specifies the group ID in the container namespace.additionalGids
(array of ints, OPTIONAL) addtional group gids in the container namespace.apparmorProfile
(string, OPTIONAL) name of the AppArmor profile for the process.capabilities
(object, OPTIONAL) as defined in capabilities(7)
.
effective
(array of strings, OPTIONAL)bounding
(array of strings, OPTIONAL)inheritable
(array of strings, OPTIONAL)permitted
(array of strings, OPTIONAL)ambient
(array of strings, OPTIONAL)noNewPrivileges
(bool, OPTIONAL)oomScoreAdj
(int, OPTIONAL)
oomScoreAdj
is set, the runtime MUST set oom_score_adj
to the given value.oom_score_adj
.selinuxLabel
(string, OPTIONAL) specifies the SELinux Label for the process.hostname
(string, OPTIONAL) specifies the container's hostname as seen by processes running inside the container.The following are under the linux
(object, OPTIONAL) property.
/proc
(type proc)/sys
(type sysfs)/dev/pts
(type devpts)/dev/shm
(type tmpfs)namespaces
(array of objects)
type
(string, REQUIRED)pid
, network
, mount
, ipc
, uts
, user
, cgroup
.path
(string, OPTIONAL) namespace filepath
.path
is not associated with a namespace of type type
.path
is not specified, runtime MUST create a new container namespace of type type
.uidMapping
(array of objects, OPTIONAL) describes the user namespace uid mapping from host to the container.
containerID
(uint32, REQUIRED) the starting uid of the containerhostID
(uint32, REQUIRED) starting uid on the host to be mappedsize
(uint32, REQUIRED) number of ids to be mapped.gidMapping
(array of objects, OPTIONAL) describes the user namespace gid mappings from host to the container.
containerID
(uint32, REQUIRED) the starting gid of the containerhostID
(uint32, REQUIRED) starting gid on the host to be mappedsize
(uint32, REQUIRED) number of ids to be mapped.devices
(array of objects, OPTIONAL) lists devices that MUST be available in the container. THe runtime MAY supply them however it likes.
type
(string, REQUIRED) - c
, b
, u
or p
(see mknod(1)
)path
(string, REQUIRED) full path to device inside container.path
that does not match the requested device, the runtime MUST generate an error.major, minor
(int64, REQUIRED unless type
is p
) major/minor number of devicesfileMode
(uint32, OPTIONAL) file mode for the deviceuid
(uint32, OPTIONAL) id of device owner in the container ns.gid
(uint32, OPTIONAL) id of device group in the container ns.type
, major
and minor
SHOULD NOT be used for multiple devices./dev/null
, /dev/zero
, /dev/full
, /dev/random
, /dev/urandom
, /dev/tty
, /dev/console
if terminal
is enabled in the config by bind mounting the pseudoterminal slave to /dev/console
. /dev/ptmx
(A bind-mount or symlink of the container's /dev/pts/ptmx
)cgroupPath
(string, OPTIONAL) path to the cgroups.
cgroupPath
cgroupPath
values to be invalid, and MUST generate an error if this is the case.resources
configures cgroup. Do not specify resources
unless limits have to be updated.
resources
settings.devices
(array of objects, OPTIONAL) configures the device whitelistalllow
(boolean, REQUIRED)type
(string, OPTIONAL)major, minor
(int64, OPTIONAL)access
(string, OPTIONAL)memory
(object, OPTIONAL) cgroup subsystem memory
limit
(int64, OPTIONAL)reservation
(int64, OPTIONAL)swap
(int64, OPTIONAL)kernel
(int64, OPTIONAL)kernelTCP
(int64, OPTIONAL)swappiness
(uint64, OPTIONAL)disableOOMKiller
(bool, OPTIONAL)cpu
(object, OPTIONAL) configure cpu
and cpusets
subsystems.shares
(uint64, OPTIONAL)quota
(int64, OPTIOANL)period
(uint64, OPTIONAL)realtimeRuntime
(int64, OPTIONAL)realtimePeriod
(uint64, OPTIONAL)cpus
(string, OPTIONAL)mems
(string, OPTIONAL)blockIO
(object OPTIONAL) confiugre blkio
subsystemweight
(uint16, OPTIONAL)leafWeight
(uint16, OPTIONAL)weightDevice
(array of objects, OPTIONAL)
major, minor
(int64, REQUIRED)weight
(uint16, OPTIONAL)leafWeight
(uint16, OPTIONAL)weight
or leafWeight
MUST be given.throttleReadBpsDevice
, throttleWriteBpsDevice
(array of objects, OPTIONAL)
major, minor
(int64, REQUIRED)rate
(uint64, REQUIRED)throttleReadIOPSDevice
, throttleWriteIOPSDevice
(array of objects, OPTIONAL)
major, minor
(uint64, REQUIRED)rate
(uint64, REQUIRED)hugepagelimits
(array of objects, OPTIONAL)pageSize
(string, REQUIRED)limit
(uint64, REQUIRED)network
(object, OPTIONAL) represent net_cls
and net_prio
subsystemsname
(string, REQUIRED)priority
(uint32, REQUIRED)pids
(object, OPTIONAL) represents pids
subsystemlimit
(int64, REQUIRED)rdma
(object, OPTIONAL)hcaHandles
(uint32, OPTIONAL)hcaObjects
(uint32, OPTIONAL)hcaHandles
or hcaObjects
MUST be selectedintelRdt
(object, OPTIONAL) represents the Intel Resource Director Technology.
intelRdt
is set, the runtime MUST write the containe id to the <container-id>/tasks
file in a mounted resctrl
pseudo-filesystem, if no resctrl
filesystem is available, the runtime MUST generate an error.l3CacheSchema
(string, OPTIONAL)l3CacheSchema
is set, runtime MUST write the value to the schemata
file in the intelRdt
schemata
files in any resctrl
pseudo-filesystemssysctl
(object, OPTIONAL) allows kernel parameters to be modified at runtime.seccomp
(object, OPTIONAL)
defaultAction
(string, REQUIRED)architectures
(array of strings, OPTIONAL)syscalls
(array of objects, OPTIONAL)names
(array of strings, REQUIRED)action
(string, REQUIRED)args
(array of objects, OPTIONAL)
index
(uint, REQUIRED)value
(uint64, REQUIRED)valueTwo
(uint64, OPTIONAL)op
(string, REQUIRED)rootfsPropagation
(string, OPTIONAL) sets the rootfs's mount propagation, value is either "slave", "private", "shared" or "unbindable".maskedPaths
(array of strings, OPTIONAL) will mask over the provided paths inside the container so that they cannot be read.
readonlyPaths
(array of strings, OPTIONAL) set the provided paths as readonly inside the container.
mountLabel
(string, OOPTIONAL) will set the SELinux context for the mounts in the container.hooks
(object, OPTIONAL) MAY contain any of the following properties:
prestart
(array of objects, OPTIONAL)start
operation is called but before the user-specified program command is executedpath
(string, REQUIRED)
args
(array of strings, OPTIONAL) with the same semantics as IEEE Std 1003.1-2008 execv's argv.env
(array of strings, OPTIONAL) with the same semantics as IEEE Std 1003.1-2008's environ.timeout
(int, OPTIONAL) is the number of seconds before aborting the hook. If set, timeout MUST be greater than zero.poststart
(array of objects, OPTIONAL) Entries in the array have the same schema.start
operation returns.poststop
(array of objects, OPTIONAL) Entries in the aray have the same schema.delete
operation returns.annotations
(object, OPTIONAL) contains arbitrary metadata for the container.
/dev/null
even though they are open.[ ] When creating the container, runtimes MUST create the following symlinks if the source file exists after processing mounts
Source | Destination |
---|---|
/proc/self/fd |
/dev/fd |
/proc/self/fd/0 |
/dev/stdin |
/proc/self/fd/1 |
/dev/stdout |
/proc/self/fd/2 |
/dev/stderr |
A filesystem bundle is the "container image artifact" that we are specifying in our artifact specification right?
I'll review this after the R&D and hiring work is done. In the mean time, which ones do you think we need to implement? I'm assuming the filesystem bundle first. Whereas a lot of the other runtime specs are already handled by runc. But for certain management requirements of containers/automatons in Emergence (which we need to spec out), we'll need to use the above specs to gather information about the containers.
Remember due to QoS constraints, we'll eventually derive resource requirements. So I need to have a list of resources that can be constrained by the container runtime, and which resources can be adjusted dynamically, and which resources must require redeployment. I have some notes about this already that I can send to you.
A filesystem bundle should be the result from unpacking the artifact.
I have some concerns regarding some of the bundle contents such as mount points and resource constraints. These can be affected by matters outside of the Artifact declaration (for example from StateSpec and the orchestrator).
We should list down all OCI container runtime specifications, and what we need to implement to be inline with the standard.