PAPI v2: Inside docker /cromwell_root seems to be mounted on non-existent device

TedBrookings commented 5 years ago

Backend: I'm testing out PAPI v2 by running on the cromwell 34 and 36 methods servers. Problem: inside the docker it looks like /cromwell_root is mounted on /dev/disk/by-id/google-local-disk (checking df -h, /proc/mounts, or /etc/mtab) but that device does not exist (in fact there is no /dev/disk directory). Background: This task requests a persistent HDD and runs inside a docker. This problem does not exist on cromwell 30 (with jes backend). /cromwell_root is almost certainly actually mounted at /dev/sdb (that device exists, does not appear to be used anywhere, has the appropriate size as checked in /sys/block/sdb/size, and is typically what's listed as the filesystem in cromwell 30).

I know it's weird to even care about that, so to explain, my cromwell monitoring script looks at the block device corresponding to /cromwell_root in order to measure disk IO, which can potentially be a source of problems for some of the SV algorithms we're trying to debug/string together.

the .wdl file

workflow GetSystemInfo {
    call get_system_info_docker
}

task get_system_info_docker {
    command <<<
        echo "**** df -h"
        df -h

        echo
        echo "**** /"
        ls -l /

        echo
        echo "**** /mnt"
        ls -l /mnt

        echo
        echo "**** /dev"
        ls -l /dev

        if [ -d /dev/disk ]; then
            echo
            echo "**** /dev/disk"
            ls /dev/disk
        fi

        echo
        echo "**** /proc/mounts"
        cat /proc/mounts

        echo
        echo "**** /etc/mtab"
        cat /etc/mtab

        echo
        echo "**** /sys/block"
        find -L /sys/block -maxdepth 2

        echo
        echo "**** /sys/block/sdb/size (converted to integer GB)"
        echo "$(($(cat /sys/block/sdb/size) * 512 / 2**30))"

        echo
        echo "**** /sys/devices"
        find -L /sys/devices -maxdepth 3
    >>>

    runtime {
        docker: "talkowski/delly"
        memory: "1.7 GB"
        cpu: "1"
        disks: "local-disk 250 HDD"
        preemptible: 3
    }
}

Snips of relevant output from cromwell 36 (edited for brevity):

**** df -h
Filesystem                Size      Used Available Use% Mounted on
/dev/disk/by-id/google-local-disk
                        245.1G     60.0M    245.0G   0% /cromwell_root
**** /dev
total 0
lrwxrwxrwx    1 root     root            11 Nov 14 21:16 core -> /proc/kcore
lrwxrwxrwx    1 root     root            13 Nov 14 21:16 fd -> /proc/self/fd
crw-rw-rw-    1 root     root        1,   7 Nov 14 21:16 full
drwxrwxrwt    2 root     root            40 Nov 14 21:16 mqueue
crw-rw-rw-    1 root     root        1,   3 Nov 14 21:16 null
lrwxrwxrwx    1 root     root             8 Nov 14 21:16 ptmx -> pts/ptmx
drwxr-xr-x    2 root     root             0 Nov 14 21:16 pts
crw-rw-rw-    1 root     root        1,   8 Nov 14 21:16 random
drwxrwxrwt    2 root     root            40 Nov 14 21:16 shm
lrwxrwxrwx    1 root     root            15 Nov 14 21:16 stderr -> /proc/self/fd/2
lrwxrwxrwx    1 root     root            15 Nov 14 21:16 stdin -> /proc/self/fd/0
lrwxrwxrwx    1 root     root            15 Nov 14 21:16 stdout -> /proc/self/fd/1
crw-rw-rw-    1 root     root        5,   0 Nov 14 21:16 tty
crw-rw-rw-    1 root     root        1,   9 Nov 14 21:16 urandom
crw-rw-rw-    1 root     root        1,   5 Nov 14 21:16 zero

**** /proc/mounts
/dev/disk/by-id/google-local-disk /cromwell_root ext4 rw,relatime,data=ordered 0 0

**** /etc/mtab
/dev/disk/by-id/google-local-disk /cromwell_root ext4 rw,relatime,data=ordered 0 0

**** /sys/block/sdb/size (converted to integer GB)
250

Whereas on cromwell 30

**** df -h
Filesystem                Size      Used Available Use% Mounted on
/dev/sdb                246.0G     59.1M    233.4G   0% /cromwell_root

Horneth commented 5 years ago

I asked your question to PAPI and here is the response:

This detail is not something that should be counted on in a containerized environment. That said: the /dev/disk/by-id/* system is simply a convenient alias. The underlying block storage doesn't change (eg, /dev/disk/by-id/google-local-disk is a symlink to a block device, in this case, /dev/sdb). So they should be able to continue monitoring if they want, it will just be harder to recover the mapping.

dinvlad commented 5 years ago

I've ran into the same issue, trying to measure disk I/O from monitoring_image. Have you found a solution?

TedBrookings commented 5 years ago

Not really, I have a work-around (if /sys/block/sdb/ is a directory and /dev/sdb is mounted in mtab, use /sys/block/sdb/)

function findBlockDevice() {
    MOUNT_POINT=$1
    FILESYSTEM=$(grep -E "$MOUNT_POINT\s" /proc/self/mounts \
                | awk '{print $1}')
    DEVICE_NAME=$(basename "$FILESYSTEM")
    FS_IN_BLOCK=$(find -L /sys/block/ -mindepth 2 -maxdepth 2 -type d \
                       -name "$DEVICE_NAME")
    if [ -n "$FS_IN_BLOCK" ]; then
        # found path to the filesystem in the block devices. get the
        # block device as the parent dir
        dirname "$FS_IN_BLOCK"
    elif [ -d "/sys/block/$DEVICE_NAME" ]; then
        # the device is itself a block device
        echo "/sys/block/$DEVICE_NAME"
    else
        # couldn't find, possibly mounted by mapper.
        # look for block device that is just the name of the symlinked
        # original file. if not found, echo empty string (no device found)
        BLOCK_DEVICE=$(ls -l "$FILESYSTEM" 2>/dev/null \
                        | cut -d'>' -f2 \
                        | xargs basename 2>/dev/null \
                        || echo)
        if [[ -z "$BLOCK_DEVICE" ]]; then
            1>&2 echo "Unable to find block device for filesystem $FILESYSTEM."
            if [[ -d /sys/block/sdb ]] && ! grep -qE "^/dev/sdb" /etc/mtab; then
                1>&2 echo "Guessing present but unused sdb is the correct block device."
                echo "/sys/block/sdb"
            else           
                1>&2 echo "Disk IO will not be monitored."
            fi
        fi
    fi
}

I am not sure if this is a google VM problem, a docker problem, or a problem with how cromwell specifies volumes to docker; but I took their response to be "we don't care and won't fix it". Fortunately for me the work-around nearly always works for cromwell jobs.

dinvlad commented 5 years ago

@TedBrookings thanks, I was going to just use sdb/c/... (in the disk order for the task). When does it not work?

TedBrookings commented 5 years ago

I'm using this in a fully automated setting, as part of this script: https://github.com/broadinstitute/dsp-scripts/blob/master/cromwell/methods/cromwell_monitoring_script.sh

So for me it won't work if the combination of google VM / docker / cromwell results in the disk not being mounted on /dev/sdb for any reason. This could happen if the user requests disks to be mounted in a specific place (https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/), requests more than one disk but would prefer the second disk be monitored, or if cromwell starts using /dev/sdb for some other resource and the disks get pushed to /dev/sdc

I guess it would be more precise to say, "I'm not aware of this happening, but lots of people use this script and I have no idea if any of them would complain to me if it didn't work".

dinvlad commented 5 years ago

Interesting! I may have found another - deterministic - way, based on how it's done in gopsutil:

Find the st_dev device attribute for the mount point in /proc/self/mountinfo file, which is "the most authoritative source to check your mounts" [1], and is always present in modern Linux kernels [2].

Per [3], st_dev

Identifies the device containing the file. The st_ino and st_dev, taken together, uniquely identify the file. The st_dev value is not necessarily consistent across reboots or system crashes, however.

The format of mountinfo, according to [2]:

3.5 /proc/<pid>/mountinfo - Information about mounts
--------------------------------------------------------

This file contains lines of the form:

36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue
(1)(2)(3)   (4)   (5)      (6)      (7)   (8) (9)   (10)         (11)

(1) mount ID:  unique identifier of the mount (may be reused after umount)
(2) parent ID:  ID of parent (or of self for the top of the mount tree)
(3) major:minor:  value of st_dev for files on filesystem
(4) root:  root of the mount within the filesystem
(5) mount point:  mount point relative to the process's root
(6) mount options:  per mount options
(7) optional fields:  zero or more fields of the form "tag[:value]"
(8) separator:  marks the end of the optional fields
(9) filesystem type:  name of filesystem of the form "type[.subtype]"
(10) mount source:  filesystem specific information or "none"
(11) super options:  per super block options

So for example, inside my task

grep cromwell_root /proc/self/mountinfo

904 885 8:16 / /cromwell_root rw,relatime master:325 - ext4 /dev/disk/by-id/google-local-disk rw

8:16 here is st_dev, with

(3) major:minor: value of st_dev for files on filesystem

Now we look up major minor in /proc/diskstats [4]:

The /proc/diskstats file displays the I/O statistics
of block devices. Each line contains the following 14
fields:

1 - major number
2 - minor mumber
3 - device name

So e.g.

awk '$1 == 8 && $2 == 16 {print $3}' diskstats

sdb

The same approach works for non-/cromwell_root mounts as well. Obviously, we can fully automate this lookup in a script.

[1] https://serverfault.com/a/581180/296112

[2] https://www.kernel.org/doc/Documentation/filesystems/proc.txt

[3] https://www.gnu.org/software/libc/manual/html_node/Attribute-Meanings.html

[4] https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats

broadinstitute / cromwell

PAPI v2: Inside docker /cromwell_root seems to be mounted on non-existent device #4388