cartography-cncf / cartography

Cartography is a Python tool that consolidates infrastructure assets and the relationships between them in an intuitive graph view powered by a Neo4j database.
https://cartography-cncf.github.io/cartography/
Apache License 2.0
2.99k stars 340 forks source link

Analysis and cleanup jobs are inconsistently placed #820

Open achantavy opened 2 years ago

achantavy commented 2 years ago

Description:

What issue is being seen? Describe what should be happening instead of the bug, for example: Cartography should not crash, the expected value isn't returned, the data schema is wrong, etc.

By default, the final cartography sync stage runs all analysis jobs located in the cartography/data/jobs/analysis folder. I noticed that there are cases where we call run_analysis_job() out of band such as when syncing iaminstanceprofiles and analyzing lambda-to-ecr relationships, in S3 acls, everything here, and probably others.

The problem with these one-off calls of analysis jobs is that by default they will all be run a second time when we reach the final sync stage. This is wasted work and adds time to the sync especially on a large graph.

To summarize,

  1. We need to run analysis on the entire graph. We currently do this in the default final sync stage where all jobs in cartography/data/jobs/analysis are run.
  2. We need to run analysis on a segment of the graph, for example a single AWS account or single GCP project. We currently do this with one-off calls to run_analysis_job().
  3. We need to avoid running any of these jobs twice. This could be accomplished by splitting cartography/data/jobs/analysis into separate folders, or other ideas.

Please complete the following information::

  • Cartography release version or commit hash [e.g. 0.12.0 or 95e8e11913e2a44a4d4682506d8364a638ceac69]

0.56.0

ramonpetgrave64 commented 2 years ago

From @ryan-lane https://github.com/lyft/cartography/pull/826#issuecomment-1109419468

Could we separate out the analysis jobs that are intended to run along with the code, from the ones that are intended to connect nodes between modules? It may be good to put analysis and cleanup files for modules directly into their modules.

To be honest I wasn't aware that the folder was run at the end of every run and for anyone that changes the value of that option it's not how it runs.