Very slow terraform scan on medium sized repository

andir commented 3 years ago

Describe the bug

I've just tried checkov on a medium sized repository that contains ~70 modules and around 20 Terraform workspaces. Running checkov against the entire repository takes about 1 hour and 50 minutes on current gen desktop CPU. In contrast something like tfsec only takes in the order of ~20s for the same code. My understanding is that both tools are primarily busy evaluating the logic of the Terraform code (substituting variables, instantiating modules, ...) and applying each rule set is the cheaper part of the execution. Correct me if I am wrong.

cloc says the following about my Terraform folder:

$ cloc terraform/

github.com/AlDanial/cloc v 1.88  T=2.25 s (416.3 files/s, 35823.6 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
HCL                            709          12897           5047          50407
…
-------------------------------------------------------------------------------
SUM:                           935          14249           5268          60935
-------------------------------------------------------------------------------

Running checkov against individual modules takes somewhere between a ~5 to 30s which is totally fine when looking at their complexity.

Having done some light profiling of the application it appears as if the (costly?) parsing of each HCL block in the Terraform files is done over and over again for each instance of a module.

I slapped some @lru_cache(max_size=1024) annotation on to the parse_var_blocks function which resulted in a 13% reduction of execution time. Increasing the cache size to an unbounded limit had more influence on the performance but not to the point where the runtime is anywhere near what I would feel comfortable with in CI.

Anything going from here seems to require more knowledge about the code base.

```patch commit a4606b31d29a1f5d4f392032ebc6e75f0d7ae0a2 Author: Andreas Rammhold Date: Fri Mar 19 17:09:20 2021 +0100 Cache calls to find_var_blocks While using checkov on a larger terraform repository the runtime on all the files was about 1 hour and 50 minutes. As that large repository is not a good test case I trimmed it down to a smaller part of the repo that only took about 12.736s ± 0.153s seconds before this change and with this change is at 11.196s ± 0.195s which is a nice ~13% speedup in runtime. diff --git a/checkov/terraform/parser_utils.py b/checkov/terraform/parser_utils.py index b672c24e..50be5e27 100644 --- a/checkov/terraform/parser_utils.py +++ b/checkov/terraform/parser_utils.py @@ -4,6 +4,8 @@ from dataclasses import dataclass from enum import Enum, IntEnum from typing import Any, Dict, List, Optional +from functools import lru_cache + import hcl2 @@ -48,6 +50,7 @@ class ParserMode(Enum): return str(self.value) +@lru_cache(maxsize=1024) def find_var_blocks(value: str) -> List[VarBlockMatch]: """ Find and return all the var blocks within a given string. Order is important and may contain portions of ```

Profile results after applying the above caching patch:

Profile results before applying the above caching patch:

Looking at the output of strace -e openat bin/checkov .... I can see that checkov is reading some files as often as four times within one run.

To Reproduce Steps to reproduce the behavior:

Checkout a larger terraform repository
run checkov against the directory
wait

Expected behavior I would expect a runtime in the order of seconds or minutes but not tens of minutes or almost two hours. .

Desktop (please complete the following information):

OS: Linux, using checkov docker container and a local python interpreter on NixOS during profiling/testing.
Checkov Version: 36d065755da53b59ab8371bbe349c9cb3ce526ff

schosterbarak commented 3 years ago

@andir thank you for reporting this. does it show the same performance when using the parameter --framework terraform ?

andir commented 3 years ago

@andir thank you for reporting this. does it show the same performance when using the parameter --framework terraform ?

Yeah, I must have forgotten that detail. It behaves the same from what I can tell.

andir commented 3 years ago

I did some hacking on the code base on the parts that were highlighted in the last profiling runs. One of the bigger changes was getting rid of jmespath in the Terraform parser as it was consuming unreasonable amounts of time for the tasks it was given.

My current branch can be found here: https://github.com/bridgecrewio/checkov/compare/master...andir:random-hacking

The commits should detail a bit of why I did them and/or what the improvement was. This isn't really in a state where I would like this to get merged into checkov as I haven't verified that there was no functional change.

One of the takeaways from this journey is that checkov should probably be more strict on the types it is passing along and use more type annotations. Having to sprinkle isinstance everywhere isn't just slow (because of the amount of checks that are being done) but also kind of bad style as the code always contains one or more additional logical branches given on the runtime type. The burden of passing in the right types should be on the side of the caller not the called.

andir commented 3 years ago

Update: Using the above linked branch and running it against the same TF repository I mentioned in the original post the runtime is now down to 30m from the 1h and 50min I observed previously. This still requires a ~5-10x improvement before it is remotely acceptable for our CI runners but I am confident that we might be able to get there.

schosterbarak commented 3 years ago

@robeden wdyt on the jmepath changes?

robeden commented 3 years ago

@andir - Thanks for the report and apologies for the issues. I knew the current parser implementation isn't very efficient, but that's a good order of magnitude worse than I would have expected.

Avoidance of jmespath and usage of the lru_cache makes sense, but we can probably do better.

The problem is going to stem from the "looping parser" approach. The ultimate fix is to process as a graph, and fortunately that work is ongoing. That should make performance as efficient as possible. For the current implementation, I think we should be able to cut out some whole branches of work.

@andir, can I assume that in your repository you have modules loading other modules? We're probably hitting some M*N loop scenarios that we might be able to cut out.

andir commented 3 years ago

@andir - Thanks for the report and apologies for the issues. I knew the current parser implementation isn't very efficient, but that's a good order of magnitude worse than I would have expected.

Avoidance of jmespath and usage of the lru_cache makes sense, but we can probably do better.

:+1:

The problem is going to stem from the "looping parser" approach. The ultimate fix is to process as a graph, and fortunately that work is ongoing. That should make performance as efficient as possible. For the current implementation, I think we should be able to cut out some whole branches of work.

Yeah, the graph library would be my next stop at seeing what can be optimized there. It didn't look particular efficient in execution. It seems to be doing a lot of type checking yet again (and probably other stuff that isn't great either).

@andir, can I assume that in your repository you have modules loading other modules? We're probably hitting some M*N loop scenarios that we might be able to cut out.

Yes, we have a lot of internal modules that are basically just instantiating other modules to deduplicate most of our logic.

Feel free to loop me in on any ideas you have towards improving the code. I am happy to spent some of my (evening) time on this :-)

andir commented 3 years ago

I just pushed a few more changes to the mentioned branch. The changes I did was reducing the duplicate dpath lookups and replace one of them with a simple dictionary get() call. I am now down to 23min of execution time. We can probably make this faster by ditching dpath in general. It doesn't seem like the graph is the primary problem but the way the lookups are made.

robeden commented 3 years ago

I tested the lru_cache on find_var_blocks and I'm seeing unit test failures when that's added. Still trying to figure out if they're real problems or not...

andir commented 3 years ago

I tested the lru_cache on find_var_blocks and I'm seeing unit test failures when that's added. Still trying to figure out if they're real problems or not...

I could imagine there being failures in tests if the returned objects are being mutated somewhere. That would likely yield to the cache returning "dirty" results instead of what you would expect.

robeden commented 3 years ago

I could imagine there being failures in tests if the returned objects are being mutated somewhere. That would likely yield to the cache returning "dirty" results instead of what you would expect.

I don't think that should be the case... but it's possible. Sorry, haven't had time to dig into it more yet.

schosterbarak commented 3 years ago

@andir @robeden please wait with new commits on this. @nimrodkor's team is leading a big performance change. The PR should be opened soon. I hope we will see different results after those changes. We did see a very big change on our own repos.

schosterbarak commented 3 years ago

here it is: https://github.com/bridgecrewio/checkov/pull/1023/files

robeden commented 3 years ago

@andir - Nimrod reports run time reduction of about 60%, but I expect it to be much higher in your case. The iterative approach has a loop on variables and then a loop on modules (possibly at multiple levels) so I expect you’re seeing an “N * M” (at least) scenario.

andir commented 3 years ago

Thanks for the work! I'll give this a try in a bit. It is a massive PR that must have been cooking for a while! Will have a look at it.

andir commented 3 years ago

I just finished running this against the same repository as mentioned initially. The entire process now only takes about 4 minutes instead of the initial 1h 50min! I did not apply any patches to the code.

Not sure if it is new but I didn't managed to run this code without network access (anymore?). I had to set an invalid http proxy to get it going without network:

$ docker run -e https_proxy=http://localhost --network none --rm -v /my-code:/my-code-it checkov:local --no-guide -d /my-code

schosterbarak commented 3 years ago

hi @andir can you share the results with latest checkov version? did it improve the duration?

andir commented 3 years ago

hi @andir can you share the results with latest checkov version? did it improve the duration?

I just ran the latest version from the master branch (99707fd10a5052b798314342f012ef4e139f92eb) against my repo.

Testing just our modules subdir is now taking only 19 seconds. Running it against all the workspaces at the same time takes about 2 minutes and 22 seconds.

A huge improvement from the initial results! Thanks for your work on this! This is already much closer to what I would consider production ready :-)

schosterbarak commented 3 years ago

great @andir - i'm closing the issue for now. If you'd like to make further improvements, we would accept a PR.

bridgecrewio / checkov

Very slow terraform scan on medium sized repository #1010