aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.6k stars 3.9k forks source link

CDK needs garbage collection for assets in the cdk.out directory #2869

Open nathanpeck opened 5 years ago

nathanpeck commented 5 years ago

Each time I run a CDK deploy I get a new asset in the asset's directory, and they seem to accumulate forever. Each asset folder is around 100 MB for me, so this quickly adds up to many GB of data. Here is a screenshot of it accumulating assets again after the last time I cleaned it out manually.

Screen Shot 2019-06-13 at 2 26 47 PM

Ideally I would like a CDK configuration that would cause CDK to automatically garbage collect older asset files it no longer needs so I don't have to do it manually.

RomainMuller commented 5 years ago

Duplicated by #3749

SomayaB commented 5 years ago

Seems to be related to #1332

SomayaB commented 5 years ago

Hi @nathanpeck, thanks for submitting a feature request! This seems like a reasonable and helpful ask. We will look into this and someone will update this issue when there is movement.

eladb commented 4 years ago

I think that if users do cdk deploy we should actually emit cdk.out directory under /tmp instead of the project directory. When users deploy, cdk.out is just an intermediate artifact instead of a build artifact.

P.S. it should be something like /tmp/cdk.out.xxxx where xxxx is the hash of the project path (in order to allow multiple projects to co-exist on the same machine).

nathanpeck commented 4 years ago

@eladb I do worry that would reduce the visibility of the folder. Particularly in cases where I have multiple projects and for some reason my stacks aren't generating as expected I would hate to have to figure out which of the outputs inside my tmp folder is the right one.

I think while it is tempting to piggyback on the existing tmp cleanup behavior I don't think that it would be good for users of CDK, because it would end up being a hidden cache behavior that would be harder to clear when needed

eladb commented 4 years ago

If you do cdk synth output will still go to ./cdk.out which will give you visibility into exactly what's going to be used during deployment.

I am not sure I understand why you think putting intermediate (temporary) build artifacts is not a good use case for /tmp. Isn't that what /tmp is all about?

nathanpeck commented 4 years ago

@eladb I don't think of the build artifacts as temporary.

For example if I GCC compile I would expect my C++ files to turn into object files in a local path, not in the /tmp folder.

Or if I TypeScript compile I expect the resulting JavaScript to end up in the local directory, not in /tmp

From that perspective I see CDK to CloudFormation / assets as just another type of transformation, where I expect the resulting product to be local, not remotely cached

I'm not strictly opinionated on this, but it just feels somewhat strange to me if the cdk.out is located in a different folder outside of my project

plumdog commented 4 years ago

I found this issue from a different direction - I have some tests for my CDK code, and each time I run them it is building a new asset directory and putting it in /tmp, a new one for each test case. The assets for me happened to my 100s of MB, and soon my /tmp device was full.

I think I would expect that - by default - assets for test runs were deleted after the test run had completed, regardless of where they are stored.

0xdevalias commented 4 years ago

In the interim.. is it ok to just manually clear out anything in this folder (or even the whole folder)? I've left it building up for now as I wasn't sure if they were required somewhere down the line/for cdk diff support/etc.

skinny85 commented 4 years ago

No, there should be no danger in removing cdk.out locally - it will be re-created next time CDK is executed.

cloud-context commented 4 years ago

Is is possible to change where these cdk.outxxxxx folders are created when running unit tests?

Our current plan is to have a process to clean up the /tmp folder after the tests are run but the problem is that this is on our build agent and it doesn't have a huge '/tmp` directory and potentially multiple builds running at once

eladb commented 4 years ago

Is is possible to change where these cdk.outxxxxx folders are created when running unit tests?

You should be able to specify the output directory when you create an App:

const app = new App({ outdir: '/tmp/foo' });
const stack = new MyTestStack(app, 'test');
// ...
cloud-context commented 4 years ago

I tried that setting for the app but it only seems to work for a synth command. When I run the CDK unit tests, there are multiple cdk.out directories created in the /tmp folder - I would like to change this directory if possible

cjjenkinson commented 4 years ago

Shorter term solution with bash find . -name 'asset.*.zip' -print0 | xargs -0 rm

I run this at the end of deployments

cynicaljoy commented 3 years ago

I think that if users do cdk deploy we should actually emit cdk.out directory under /tmp instead of the project directory. When users deploy, cdk.out is just an intermediate artifact instead of a build artifact.

P.S. it should be something like /tmp/cdk.out.xxxx where xxxx is the hash of the project path (in order to allow multiple projects to co-exist on the same machine).

I agree that the cdk.out should be moved to the tmp, I'd vote that the folder path be more verbose though: .e.g /tmp/aws-cdk/{projectHash}/cdk.out/ - we use the /tmp directory for a variety of things and selfishly I don't want to dozens of items in the root of /tmp for CDK alone.

In order to provide easy access to the cdk.out directory, you could either:

lprhodes commented 3 years ago

I have another use case for more control over the asset directories.

I'm using CDK with SAM CLI and I'm trying to use tsc-watch to re-run the cdk synth after detecting changes to typescript. Due to a new asset directory being created each time SAM needs to be restarted.

The workaround I'm about to implement is to get the existing asset directory name, delete it, then rename the new asset directory to the old one after cdk synth. There's the possibility that SAM will keep a pointer to the original directory which is moved to trash but we shall see!

lprhodes commented 3 years ago

Nice, it works: https://gist.github.com/lprhodes/89e4436df3d73ac26cf5e89a6fc8ec0a

jbvsmo commented 3 years ago

Please don't move cdk.out to /tmp as people who never reboot will have that thing blowing up as well. Also it is not safe when deploying multiple projects since the erase solution above would remove anything inside /tmp

I had literally over 100 asset.XXXXX directories each weighing 85MB and since those have tons of small files it took a few minutes to delete the 9GB of data.

Why isn't all that being just deleted right after deploy (or before deploy so we keep last one)? If I would like to keep the data, I could explicitly ask for it.

leantorres73 commented 3 years ago

I think cdk synth should clean the folder and create it again

acomagu commented 3 years ago

How about automatic cleanup based on the creation date?

For example, configure cdk.json like:

{
  "app": "bin/synth",
  "autoCleanOutdatedAssetsBefore": "3days" // The assets created before 3 days are automatically deleted(on running `cdk synth` or etc.)
}
mjsztainbok commented 2 years ago

This is problematic with CDK tests as every test run creates a new directory in /tmp and when writing tests it fills up the hard disk space quite quickly

jtnz commented 2 years ago

I've run out of space (aka memory on Linux) in /tmp many times because the /tmp/cdk.out* dirs.

Never had a problem around cdk.out in the project root, but I haven't been doing much cdk synth locally (we use pipelines).

dougperkes commented 2 years ago

+1 to finding a solution for this. I just had to clean up ~70GB of files from my cdk.out directory in my project.

lazinessdevs commented 2 years ago

Why not just delete the cdk.out folder before each synth ou deploy?

skinny85 commented 2 years ago

Why not just delete the cdk.out folder before each synth ou deploy?

Because all Assets would have to be re-staged on every synth that way (the ZIP files re-zipped, etc.), making it even slower than it is now.

mrgrain commented 2 years ago

I've run out of space (aka memory on Linux) in /tmp many times because the /tmp/cdk.out* dirs.

I'm surprised by this. Is there no OS level garbage collection for /tmp in your distribution?

jankatins commented 2 years ago

I'm surprised by this. Is there now OS level garbage collection for /tmp in your distribution?

/tmp is a ramdisk (at least on my linux systems), so is gone after a restart/logout. But if you restart only once in a blue moon, running out of space will happen...

mrgrain commented 2 years ago

I'm surprised by this. Is there now OS level garbage collection for /tmp in your distribution?

/tmp is a ramdisk (at least on my linux systems), so is gone after a restart/logout. But if you restart only once in a blue moon, running out of space will happen...

Thanks for clarifying this. πŸ‘πŸ»

ryanwilliams83 commented 2 years ago

I'm using C# and the DockerImageFunction construct and I just stumbled across 45GB of assets in cdk.out

My Program.cs now has the following

    public static void Main(string[] args)
    {
        if (Directory.Exists(@"cdk.out"))
        {
            Console.Error.WriteLine(@"Erasing cdk.out/");
            Directory.Delete(@"cdk.out", true);
            Console.Error.WriteLine(@"Erased cdk.out/");

            Console.Error.WriteLine(@"Creating cdk.out/");
            Directory.CreateDirectory(@"cdk.out");
            Console.Error.WriteLine(@"Created cdk.out/");
        }

        var app = new App();
        ...
nathanpeck commented 2 years ago

Should be warned that if you delete your cdk.out folder every time then it will make CDK much slower because CDK will not be able to reuse previously prepared assets, and will have to prepare them from scratch each time. Ideally you have some process to only clean up asset files that are older than a specific cutoff date or once the size gets over a threshold. That way your day to day usage of CDK will stay faster and you'll stop accumulating GB of data

wz2b commented 1 year ago

I'm not sure of the issues hierarchy here, but everyone should probably be aware of a parallel discussion going on in https://github.com/aws/aws-cdk-rfcs/issues/64 (opened in 2018).

I feel like clearing out cdk.out better be an okay thing to do, because I build from multiple development locations, so they aren't going to be in sync depending on if I'm working from home or my office..

Deleting things out of the staging bucket is a little scarier to me. Issues related to scaling and rollback have been raised, but I am not enough of an expert to know whether or not those are legitimate concerns.

I think it should be okay to clear out the staging bucket after you successfully deploy, but I'm not confident enough to try it on a production project. The biggest item in the staging bucket looks like it might be part of the cdk itself (maybe put there by cdk bootstrap?)

I think all this means two things:

dmeehan1968 commented 1 year ago

I work on my project in a Dropbox folder, and regularly use xattr -w com.dropbox.ignored 1 node_modules to prevent that directory being synced to Dropbox. I do the same with cdk.out, so any process that deletes the folder also removes the extended attribute and can lead to the files syncing to dropbox without me realising (until I run out of dropbox space).

The ability to move the artefacts to a directory outside the current working directory/tree (and outside of dropbox) is ideal, and I can always create a soft link for convenience from the cwd which isn’t synced.

Perpetual growth of the cdk.out directory is, IMHO, just lazy design. I appreciate that there are intermediate assets that might add extra cost to repeated synth/deploy cycles and these should be documented.

integralla commented 1 year ago

I'll add one more suggestion to the pile...

I'd like the CDK Toolkit to provide a clean command that would serve as a standardized way to clean up the local resources that are created by running other toolkit commands such as synth.

With a clean command in place, developers can add a process to an appropriate phase of their build life cycle, based on their specific project needs. For example, with a JVM project using Apache Maven, the exec-maven-plugin could be used to execute the command (I do something similar today with a shell script).

Of course, the templates provided for use with the init command could also provide a sensible default.

j-murata commented 1 year ago

My CDK project is an npm package, and I utilize npm pre scripts to remove the cdk.out directory before executing the cdk command.

package.json ```json { "scripts": { "cdk": "cdk", "precdk": "shx rm -rf cdk.out" } } ``` > _I use [shx](https://github.com/shelljs/shx) to make it work on cross-platform._

Then run npm scripts as follows:

$ npm run cdk -- diff
$ npm run cdk -- deploy

If the environment in which the cdk command is executed is limited, the easiest solution may be to define a shell alias for cdk.

I hope this is of some help.

huantbui commented 1 year ago

I have another use case for more control over the asset directories.

I'm using CDK with SAM CLI and I'm trying to use tsc-watch to re-run the cdk synth after detecting changes to typescript. Due to a new asset directory being created each time SAM needs to be restarted.

The workaround I'm about to implement is to get the existing asset directory name, delete it, then rename the new asset directory to the old one after cdk synth. There's the possibility that SAM will keep a pointer to the original directory which is moved to trash but we shall see!

@lprhodes I figured out a solution for this cdk.out/asset.* hash folder. Since aws-cdk > NodejsFunctionProps.bundling. commandHooks, you can create a utility sh script to run it without re-running aws cdk every time... as it is time consuming...

sample code:

 afterBundling(inputDir: string, outputDir: string): string[] {
          const outFile = join(outputDir, "index.js");
          const scriptPath = join(inputDir, "..", ".scripts");
          const shFile = fileName.replace(".ts", ".sh");
          return [
            `mkdir -p ${scriptPath}`,
            `echo esbuild ${inputDir}/${fileName} --outfile=${outFile} --watch --bundle --target=node18 --platform=node > ${scriptPath}/${shFile}`,
          ];
        },

And then in my package.json > scripts, I have "watch:lambda": "sh .scripts/<file_name>.sh"

When you run that script, esbuild is actually running watch and recompiles your changes and put it out to the cdk.out/asset.* folder path (thanks to commandHooks outputDir)...

Hope that helps! I was able to code my lambdas in typescript and re-run the lambda without costing so much time for the cdk re-runs.

Resources: