aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 320 forks source link

[Fargate] [request]: ECS Fargate (Spot?) no longer allows for file system changes inside container #1474

Open ATLJLawrie opened 3 years ago

ATLJLawrie commented 3 years ago

Community Note

Tell us about your request Has the behavior of Fargate 1.4 changed regarding the ephemeral Docker filesystem?

Which service(s) is this request for? Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Prior to ~Aug. 5 2021 we were able to ECS exec inside of our application and perform installation for troubleshooting or convenience similar to the installation of vim described here - https://aws.amazon.com/blogs/containers/connecting-to-an-interactive-shell-on-your-containers-running-in-aws-fargate-using-aws-copilot/

We noticed this change in behavior along with failures in our application as it needed to access /tmp and other locations within the file system to write ephemeral files. In order to resolve those endpoints we have gone with explicit Dockerfile "VOLUME" definitions but don't have a great solution to facilitate that for the entire container filesystem. We aren't aware of any changes on our side and even witnessed tasks that have been continuously running since prior to August 5th where the ability to make changes within their file system just randomly stopped.

Are you currently working around this issue? Defining Docker volumes for folders that are changed.

Additional context Primary PID run as root. Forcing "user: root" in TaskDefinition made no impact Readonlyrootfilesystem was "null" setting to "false" made no difference. us-east-1 Region

mreferre commented 3 years ago

@ATLJLawrie You mentioned some level of file system layout mismatch (specifically /tmp)? Does that mean you used to see a /tmp folder and you no longer see it (forcing you to create a Docker volume and mount it there)? What are the "other locations" you are referring to? Can you offer potential steps to reproduce the problem you are alluding to?

ATLJLawrie commented 3 years ago

@mreferre Essentially what we are seeing is that existing folders within a running Fargate task are suddenly no longer able to have files created in them. Previously we would have been able to ECS exec into a running container and create files anywhere in the filesystem and run utilities i.e. apt-get update apt-get install vim. On those same tasks/containers that have been continuously running we now see behavior as if they were configured with a Readonlyrootfilesystem flag of true. When it is either undefined (shows null in the JSON view of the Task Definition in console) or defined false it still behaves in this manner.

# whoami
root
# cd /tmp
# ls -al
total 8
drwxrwxrwt 1 root root 4096 Aug 11 16:42 .
drwxr-xr-x 1 root root 4096 Aug 11 19:08 ..
# touch testfile
touch: cannot touch 'testfile': No such file or directory

This happened across multiple AWS accounts and required us to redeploy applications that needed the ability to write various temp files or whatnot to have those folder locations explicitly defined as either Task Definition volumes or VOLUME within the Dockerfile.

mreferre commented 3 years ago

One more clarification. You have seen this behavior on new tasks you launched after a certain date (not on existing long running tasks that morphed their behavior). Correct?

mreferre commented 3 years ago

Also, consider that this is a repo used to track roadmap requests and not technical issues / bugs. I am happy to help as far as I can but you probably should engage support for tracking this. Are you in a position to open an incident?

ATLJLawrie commented 3 years ago

@mreferre Totally understood re: support, we can and will elevate on those channels. More arising here to see if others had encountered in the last few weeks or there was some change to normal behavior. For us it occurs on existing (running several weeks) as well as new containers. The existing is where it made the least sense. We had applications that had been writing to the filesystem and then it just stopped permitting it. All of them nowhere near hitting the 20GB limit. Effectively without defining VOLUME or Mappings we can't make any changes in any Fargate container regardless of AWS account and thus any ability to install tools for troubleshooting is broken. The only other example I found was a year ago on Reddit https://www.reddit.com/r/aws/comments/hc7ohb/permission_denied_when_writing_to_tmp_on_fargate/ that had no follow up and isn't exactly the same.

komatom commented 2 years ago

@ATLJLawrie I am having similar issues, sometimes when I push an image to ECR and then start it on ECS(fargate) i just can't write on the ephemeral storage, for example startup services couldn't write their pid files etc, and container fails to start... If I redo the deploy it sometimes do act normal and can write on the instance storage(root, not EFS mount for example). Have you found out why this is happening till now?

ATLJLawrie commented 2 years ago

@komatom No unfortunately no luck. We had to use VOLUME definitions for all the locations within the file system where our containerized applications make file system changes or do things like you mention of making a PID file. It's very unfortunate because we try and keep our images lean and only install applications in the ECS Exec shell if they are needed ad-hoc (vim, curl, etc) but because of these issues that doesn't seem to work. We had debated trying Fargate 1.3 but we need the networking capabilities of 1.4.

My speculative correlation was that it seems to happen when I have a task definition with multiple containers that stay running and share a volume. This is despite the fact that ephemeral volume is marked in the MountPoint as ReadOnly: false for all MountPoint definitions. I say "stays running" because we use the sidecar pattern frequently. Our containers where we sidecar to inject some files via a sidecar into the shared volume but ultimately exit seem to still permit the ability to create files within the running container. However, upon doing additional troubleshooting it made no actual impact. At this point the only thing I can see is that my alpine based containers are fine but my debian based doesn't play nice. HOWEVER that exact same docker image when run locally (Docker Desktop for Mac) has no issues making changes to the running filesystem.

So TLDR still have it, still no proper fix, using VOLUME definitions every single place my app needs to make changes.

komatom commented 2 years ago

@ATLJLawrie thank you for the extended reply.. It is absolutely same for me, I have contacted aws technical support - this is super odd. I also use debian based ones, although it is not happening all the time which makes it worse.. When you use VOLUME definitions do you just mark the folders as such or you also attach EFS share to that VOLUME definitions outside dockerfile volume definitions?

Sodki commented 2 years ago

I have the same issue. I've noticed that I only have writes failing on directories on the third level at least. For example, writing to /home/user would work, but not to /home/user/app.

ATLJLawrie commented 2 years ago

@komatom I only mark the VOLUME in my Dockerfile. Then based on AWS documentation (and docker / containerd in general) the host running the container allocates a local temporary volume bound into the container.

komatom commented 2 years ago

@Sodki it happens for all kinds of directories for me, it is somehow related to the way the image is extracted on the ephemeral storage from ECR.. or shortly related to overlay file system but only on the first load of the instance