docker-archive / for-aws

92 stars 26 forks source link

Anomalous size of root file system on C5/M5 instances #148

Closed kinghuang closed 5 years ago

kinghuang commented 6 years ago

Expected behaviour

Docker for AWS uses Moby Linux AMIs. The Moby Linux 18.03 AMI (ami-ccf668b4) added ENA and NMVe support, making it usable with current generation EC2 instances (#128). However, there appears to be some sort of problem working with the instance's backing EBS volume in current generation instances. The size of the root file system seen in the instance is dramatically smaller than the size of the underlying EBS volume.

As a baseline, here is what's shown for a t2.micro instance with a 20 GB EBS volume.

~ $ df -h /
Filesystem                Size      Used Available Use% Mounted on
overlay                  19.7G    262.0M     18.4G   1% /

Actual behavior

A c5.large instance with a 48 GB EBS volume shows 1.8 GB.

~ $ df -h /
Filesystem                Size      Used Available Use% Mounted on
overlay                   1.8G      1.4G    467.3M  75% /

A m5.large instance with a 20 GB EBS volume shows a size of 3.7 GB.

~ $ df -h /
Filesystem                Size      Used Available Use% Mounted on
overlay                   3.7G    302.5M      3.5G   8% /

For some unknown reason, the full space of the root block EBS isn't available. As a result, Docker nodes very quickly run out of disk space.

Information

The current Docker for AWS template doesn't have entries for c5 and m5 instance types (#146). They can be manually added manually. For the purposes of this issue, it's more convenient to just directly launch EC2 instances with the Moby Linux 18.03 AMI and select c5/m5 instance types.

Steps to reproduce the behavior

From the EC2 Management Console:

  1. Click Launch Instance.
  2. Under Community AMIs, search for and select Moby Linux 18.03.0-ce-aws1 stable (ami-ccf668b4).
  3. Choose an m5 or c5 instance type.
  4. Configure instance details as appropriate.
  5. For the root volume, specify a reasonable size like 20 GB. I've been using GP2 as the volume type.
  6. Finish configuring the instance and launch it.
  7. After the instance launches, connect to it via SSH (i.e., into the shell-aws container).
  8. Run df -h /. Observe the size of the root filesystem.
kinghuang commented 6 years ago

Here's a couple more tests. c4.large and r4.large with 20 GB volumes. Both seem normal.

c4.large

~ $ df -h /
Filesystem                Size      Used Available Use% Mounted on
overlay                  19.7G    262.0M     18.4G   1% /

r4.large

~ $ df -h /
Filesystem                Size      Used Available Use% Mounted on
overlay                  19.7G    262.0M     18.4G   1% /
kinghuang commented 6 years ago

System log from a m5.large instance showing 3.7 GB for a 20 GB volume.

https://gist.github.com/kinghuang/9842ee461a6ebe9b66e47c0e1f3a6eb1

~ $ df -h /
Filesystem                Size      Used Available Use% Mounted on
overlay                   3.7G    302.5M      3.5G   8% /
kinghuang commented 6 years ago

This seems to be the problematic bit that differs on m5/c5 instances.

* Configuring host block device ... * ERROR: automount failed to start

https://gist.github.com/kinghuang/9842ee461a6ebe9b66e47c0e1f3a6eb1#file-system-log-L392

FrenchBen commented 6 years ago

Thanks for the clear steps to replicate this @kinghuang Another good way to see this is by adding lsblk via apk --update add util-linux The output on a c5/m5 machine will be:

/ # lsblk
/ #

Any other instance will be:

/ # lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvdb    202:16   0  100G  0 disk
└─xvdb1 202:17   0  100G  0 part /var
/ #

Turns out amazon is mounting the disk at a completely different location, which means that we fail to see it, and thus mount it 😔 https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html

I'll update our support for NVMe in order to get this resolved.

HosseinAgha commented 6 years ago

@FrenchBen is this fixed?
can we deploy a stack using m5/c5 instances without this issue now?
Sorry for double checking but as the code for docker-for-aws template is not present in this repository we cannot find out when you make your changes.
I strongly suggest that you put the latest cloud formation template file in this repository.

kinghuang commented 6 years ago

I briefly tried the new Moby Linux 18.03.0-ce-aws2 image (ami-3260064a in us-west-2) a few days ago and it seems to be fixed. Haven't tested it extensively yet, though.

jwitko commented 6 years ago

I tried 18.03.1-ce-aws1 (outside of docker for AWS) yesterday on c5.2xlarge instances, it does not work. The EBS volume is created and attached to the EC2 instance but the docker service never comes up and it fails to mount. Here are the relevant error messages from the docker service:

Aug 13 16:48:25 dockerd[9408]:  level=error msg="time=\"2018-08-13T20:48:25Z\" level=error msg=\"could not fetch volume: Volume Not Found\" name=CloudStorTest-1-vol operation=get " plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:25 dockerd[9408]:  level=error msg="time=\"2018-08-13T20:48:25Z\" level=info msg=\"Volume does not exist. Create fresh EBS\" name=CloudStorTest-1-vol operation=createEBS options=map[backing:relocatable ebstype:gp2 size:25] " plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg="time=\"2018-08-13T20:48:45Z\" level=info msg=\"Volume creation in new AZ succeeded: {" plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" AvailabilityZone: \"us-east-1b\"," plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" CreateTime: 2018-08-13 20:48:25.366 +0000 UTC," plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" Encrypted: false," plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" Iops: 100," plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" Size: 25," plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" SnapshotId: \"\"," plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" State: \"available\"," plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" Tags: [{" plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" Key: \"StackID\"," plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" Value: \"d41d8cd98f00b204e9800998ecf8427e\"" plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" },{" plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" Key: \"CloudstorVolumeName\"," plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" Value: \"CloudStorTest-1-vol\"" plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" }]," plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" VolumeId: \"vol-05abf79a8e6e6b4ea\"," plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg=" VolumeType: \"gp2\"" plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:48:45 dockerd[9408]:  level=error msg="}\" name=CloudStorTest-1-vol operation=createNewEBS options=map[backing:relocatable ebstype:gp2 size:25] " plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:50:50 dockerd[9408]:  level=error msg="00a6b0526771de63fe2ee4fabe9141c11b465f1b3f62d0838a3e8ddc6e7a8c56 cleanup: failed to delete container from containerd: no such container"
Aug 13 16:52:51 dockerd[9408]:  level=error msg="00a6b0526771de63fe2ee4fabe9141c11b465f1b3f62d0838a3e8ddc6e7a8c56 cleanup: failed to delete container from containerd: no such container"
Aug 13 16:54:52 dockerd[9408]:  level=error msg="00a6b0526771de63fe2ee4fabe9141c11b465f1b3f62d0838a3e8ddc6e7a8c56 cleanup: failed to delete container from containerd: no such container"
Aug 13 16:56:53 dockerd[9408]:  level=error msg="00a6b0526771de63fe2ee4fabe9141c11b465f1b3f62d0838a3e8ddc6e7a8c56 cleanup: failed to delete container from containerd: no such container"
Aug 13 16:58:51 dockerd[9408]:  level=error msg="time=\"2018-08-13T20:58:51Z\" level=error msg=\"Failed to attach volume: Volume never attached to Instance\" name=CloudStorTest-1-vol operation=mountEBS " plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:58:51 dockerd[9408]:  level=error msg="time=\"2018-08-13T20:58:51Z\" level=error msg=\"error mounting volume: Volume never attached to Instance\" name=CloudStorTest-1-vol operation=mount " plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:58:51 dockerd[9408]:  level=error msg="time=\"2018-08-13T20:58:51Z\" level=error msg=\"failed to probe volume FS: failed to open device to probe ext4: open /dev/xvdf: no such file or directory\" name=CloudStorTest-1-vol operation=mountEBS " plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:58:51 dockerd[9408]:  level=error msg="time=\"2018-08-13T20:58:51Z\" level=error msg=\"error mounting volume: failed to open device to probe ext4: open /dev/xvdf: no such file or directory\" name=CloudStorTest-1-vol operation=mount " plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:58:51 dockerd[9408]:  level=error msg="time=\"2018-08-13T20:58:51Z\" level=error msg=\"failed to probe volume FS: failed to open device to probe ext4: open /dev/xvdf: no such file or directory\" name=CloudStorTest-1-vol operation=mountEBS " plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:58:51 dockerd[9408]:  level=error msg="time=\"2018-08-13T20:58:51Z\" level=error msg=\"error mounting volume: failed to open device to probe ext4: open /dev/xvdf: no such file or directory\" name=CloudStorTest-1-vol operation=mount " plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:58:51 dockerd[9408]:  level=error msg="time=\"2018-08-13T20:58:51Z\" level=error msg=\"failed to probe volume FS: failed to open device to probe ext4: open /dev/xvdf: no such file or directory\" name=CloudStorTest-1-vol operation=mountEBS " plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9
Aug 13 16:58:51 dockerd[9408]:  level=error msg="time=\"2018-08-13T20:58:51Z\" level=error msg=\"error mounting volume: failed to open device to probe ext4: open /dev/xvdf: no such file or directory\" name=CloudStorTest-1-vol operation=mount " plugin=d281abf36c42252e979b4eccd5f3aa2c29175e89adf78ac930492d45f1a3efc9

The errors at the end may be side effects of me eventually Ctrl+C out of the docker service create command when it would fail.

kinghuang commented 5 years ago

@jwitko That's a different problem. This issue is about the root block device for the host itself, not EBS volumes created by Cloudstor. This specific issue is fixed. But, Cloudstor needs to be similarly updated to handle the different mount paths used by current generation instances.

I'm going to close this one. The Cloudstor EBS problem is covered by #157.

Shyamk17 commented 5 years ago

@kinghuang

Hi

Recently I upgraded my aws instances type to r5.4xlarge and m5.2xlarge. Post modifying the instance types, I could see / file system utilization showing as high like 98% occupied. When I checked in the / file system I dont see any file consumed high. I`ve installed nvme and ENA modules are installed and loaded. Can you please help me to identify and fix the issue?Thanks

-Shyam