chanzuckerberg / miniwdl

Workflow Description Language developer tools & local runner
MIT License
173 stars 54 forks source link

Expose statistics about additional resource usage from containers #672

Open adamnovak opened 8 months ago

adamnovak commented 8 months ago

When the TaskContainer runs a command not as a child process of the Python process (for example, if the command ends up inside a Docker container), it would be useful if the TaskContainer also kept track of the CPU/memory/IO usage of the command.

It looks like the Docker implementation would need to get the container ID out of the Swarm Service, and then stream and aggregate container stats using https://docker-py.readthedocs.io/en/stable/containers.html?highlight=stats#docker.models.containers.Container.stats

For Singularity, the contained process would still I think show up as a child of the Python process, so there wouldn't be any additional resource usage to account for. But MiniWDL might want to do its own stat collection for its child processes, because it wouldn't make sense to collect just for Docker and not report anything to the user, and if MiniWDL is reporting to the user it would want to be able to report for more than just Docker.

Toil would like to have access to these sorts of statistics to report back to the user at the end of the workflow. At UCSC we're especially struggling with high IO loads on our cluster's shared filesystem, so being able to monitor IO via the Docker daemon (which collects statistics on it) would be very useful.

mlin commented 8 months ago

@adamnovak I think the cleanest way to do this is using the cgroups resource counters that are usually available inside each container. There is an old example for this, however, unfortunately there's an ongoing transition from cgroups v1 to v2 that makes it not universally compatible. I need to spend some time with GPT-4 to make it detect the available cgroups version and read the counters accordingly.

adamnovak commented 8 months ago

I guess that approach of monitoring from within the container would work? I think that the Docker daemon is doing some of its own polling to tabulate IO statistics that aren't directly tracked by the kernel for the cgroup, but it would be able to collect the other stats.

mlin commented 8 months ago

cgroups v2 does have counters for the block I/O devices! My hope with this approach is that it should work on any docker host (including AWS etc.) that doesn't go out of its way to block the cgroups counters. Maybe it will even work for Singularity although I'm fuzzier on that.

mlin commented 7 months ago

@adamnovak Mixed update on this,

I updated https://github.com/chanzuckerberg/miniwdl/tree/main/examples/plugin_log_task_usage to work with cgroup v1 and v2. It only logs CPU memory usage, but it works both locally and on AWS Batch, probably other platforms too.

I did some work to track disk I/O too, but backed it out unfortunately because I found the cgroup I/O counters didn't include network filesystems, which was your main interest and a big one for me too. So I don't have a great answer right now...my concern with querying the local docker daemon is how to generalize it to other settings.