Closed hussainak closed 3 years ago
This could potentially be improved by running parallel processes, but such improvement is not currently a priority.
As a workaround, you can split your inventory file into multiple files and run CFT scorecard separately on each.
Implementing this would require working across a few files, but the main logic is here: https://github.com/GoogleCloudPlatform/cloud-foundation-toolkit/blob/master/cli/scorecard/violations.go#L103
did you guys profile this application earlier?
Hello again Morgante,
I have profiled this application and looks like majority of the time is being spent as follows, it will be easier for you to pinpoint where the issue is and I can attempt looking at it:
Hi @morgante
As the first step I am trying to use the newline format json parser and looks like I am hitting a bug, are you able to advise on the following code:
func getDataFromReaderNew(reader io.Reader) ([]*validator.Asset, error) {
var pbAssets []*validator.Asset
decode := json.NewDecoder(reader)
for {
pbAsset := &validator.Asset{}
if err := decode.Decode(&pbAsset); err == io.EOF {
break // done decoding file
} else if err != nil {
// handle error
}
//pbAsset, err := getAssetFromJSONNew(pbAsset)
//if err != nil {
// return nil, err
//}
//phew, err1 := cvasset.ConvertResourceViaJSONToInterface(pbAsset)
//if err1 != nil {
// return nil, errors.Wrapf(err1, "fetching ancestry path for %s", pbAsset)
//}
//fmt.Println(phew)
if pbAsset.OrgPolicy != nil {
if pbAsset.OrgPolicy != nil {
orgPolicy := pbAsset.GetOrgPolicy()
for _, v := range orgPolicy {
v.UpdateTime = nil
}
}
}
jsn, err := json.Marshal(pbAsset)
if err != nil {
return nil, errors.Wrap(err, "marshaling to json")
}
umar := &jsonpb.Unmarshaler{AllowUnknownFields: true}
pbAsset1 := &validator.Asset{}
if err := umar.Unmarshal(strings.NewReader(string(jsn)), pbAsset1); err != nil {
return nil, errors.Wrap(err, "unmarshaling to proto")
}
err = cvasset.SanitizeAncestryPath(pbAsset1)
if err != nil {
return nil, errors.Wrapf(err, "fetching ancestry path for %s", pbAsset1)
}
//Log.Debug("Asset converted", "name", pbAsset1.GetName(), "ancestry", pbAsset1.GetAncestryPath())
//phew1, err2 := cvasset.ConvertResourceViaJSONToInterface(pbAsset1)
//if err2 != nil {
// return nil, errors.Wrapf(err2, "fetching ancestry path for %s", pbAsset1)
//}
//fmt.Println(phew1)
//vlidate := cvasset.ValidateAsset(pbAsset1)
//fmt.Println(vlidate)
pbAssets = append(pbAssets, pbAsset1)
}
return pbAssets, nil
}
The error is:
GCP target Constraint Framework review call failed: validation.gcp.forsetisecurity.org: __modset_templates["validation.gcp.forsetisecurity.org"]["GCPIAMRestrictServiceAccountKeyAgeConstraintV1"]_idx_0:39: eval_builtin_error: time.parse_rfc3339_ns: parsing time "1900-01-01T01:00:006Z" as "2006-01-02T15:04:05Z07:00": cannot parse "6Z" as "Z07:00"
Hey @hussainak,
Thanks for working on this! You actually shouldn't need to do anything special in the getDataFromReader
function—the default scan already splits by lines today.
Instead, I would concentrate on parallelizing what happens after an asset has been scanned in. Today, it builds an array of all assets. You probably want to look at instead sending assets into a channel after they are scanned (where parallelized routines can scan them for violations).
@morgante I have looked at whats consuming the most time and looks like it was the second part of getViolations() method that is building RichViolations. I changed it to use ParallelValidator but looks like its performing the worst and also the ParallelValidator will not allow to add the asset to RichViolation. Changes that I made and tested are below:
//score.go
// ScoringConfig holds settings for generating a score
type ScoringConfig struct {
categories map[string]*constraintCategory // available constraint categories
constraints map[string]*constraintViolations // a map of constraints violated and their violations
validator *gcv.ParallelValidator // the validator instance used for scoring
}
// NewScoringConfigFromValidator creates a scoring engine with a given validator.
func NewScoringConfigFromValidator(v *gcv.ParallelValidator) *ScoringConfig {
config := &ScoringConfig{}
config.validator = v
return config
}
// NewScoringConfig creates a scoring engine for the given policy library
func NewScoringConfig(ctx context.Context, policyPath string) (*ScoringConfig, error) {
flag.Parse()
stopChannel := make(chan struct{}, 200)
//defer close(stopChannel)
cv, err := gcv.NewValidator([]string{filepath.Join(policyPath, "policies")},
filepath.Join(policyPath, "lib"))
if err != nil {
return nil, err
}
v := gcv.NewParallelValidator(stopChannel, cv)
config := NewScoringConfigFromValidator(v)
return config, nil
}
//violations.go
// getViolations finds all Config Validator violations for a given Inventory
func getViolations(inventory *InventoryConfig, config *ScoringConfig) ([]*RichViolation, error) {
var err error
var pbAssets []*validator.Asset
start := time.Now()
fmt.Println("Fetching data: ", start)
// Code to measure
if inventory.bucketName != "" {
pbAssets, err = getDataFromBucket(inventory.bucketName)
if err != nil {
return nil, errors.Wrap(err, "Fetching inventory from Bucket")
}
} else if inventory.dirPath != "" {
pbAssets, err = getDataFromFile(inventory.dirPath)
if err != nil {
return nil, errors.Wrap(err, "Fetching inventory from local directory")
}
} else if inventory.readFromStdin {
pbAssets, err = getDataFromStdin()
if err != nil {
return nil, errors.Wrap(err, "Reading from stdin")
}
}
duration := time.Since(start)
fmt.Println("Fetched data in: ", duration)
start = time.Now()
fmt.Println("Reviewing Violations: ", start)
richViolations := make([]*RichViolation, 0)
violations, err := config.validator.Review(context.Background(), &validator.ReviewRequest{
Assets: pbAssets,
})
if err != nil {
return nil, errors.Wrapf(err, "reviewing asset")
}
duration = time.Since(start)
fmt.Println("Reviewed Violations in: ", duration)
fmt.Println(violations)
//for _, violation := range violations.Violations {
// richViolation := RichViolation{*violation, "", violation.Resource, violation.Message, violation.Metadata, nil}
// richViolations = append(richViolations, &richViolation)
//}
return richViolations, nil
}
Normal, validator processes 400MB file in 37 minutes on my OSX machine whereas the ParallelValidator takes longer, some numbers:
//With stopChannel := make(chan struct{})
Generating CFT scorecard
Fetching data: 2021-07-19 21:51:38.843216 +1000 AEST m=+11.227749735
Fetched data in: 2m8.447783655s
Reviewing Violations: 2021-07-19 21:53:47.289137 +1000 AEST m=+139.675554602
Reviewed Violations in: 57m20.78327486s
//With stopChannel := make(chan struct{}, 20)
Generating CFT scorecard
Fetching data: 2021-07-20 11:48:29.249778 +1000 AEST m=+0.637908807
Fetched data in: 1m5.281816316s
Reviewing Violations: 2021-07-20 11:49:34.531355 +1000 AEST m=+65.919745659
Reviewed Violations in: 1h6m12.25064129s
Hi @morgante
Here's an update, I have made a separate function to process violations in parallel and ran it over a 100MB CAI file. It gets into some sort of race condition (used sync.WaitGroup())for 400MB file and I think the number of routines that are spun up needs to be bounded. So I went with a published module by a third party developer. Let me know how you guys will proceed Here are the results with and without parallel routines:
Generating CFT scorecard - getViolations()
Fetching data: 2021-07-21 16:11:10.303949 +1000 AEST m=+1.313751808
Fetched data in: 22.685511317s
Reviewing Violations: 2021-07-21 16:11:32.989771 +1000 AEST m=+23.999296513
Reviewed Violations in: 6m31.30912049s
Time taken to write results: 752.258045ms
Generating CFT scorecard - getViolationsParallel()
Fetching data: 2021-07-21 17:31:16.386379 +1000 AEST m=+1.312408356
Fetched data in: 21.755561449s
Reviewing Violations: 2021-07-21 17:31:38.141739 +1000 AEST m=+23.068002501
Reviewed Violations in: 1m47.960509658s
Time taken to write results: 1.256456571s
Parallel code snippet
func getViolationsParallel(inventory *InventoryConfig, config *ScoringConfig) ([]*RichViolation, error) {
var err error
var pbAssets []*validator.Asset
start := time.Now()
fmt.Println("Fetching data: ", start)
if inventory.bucketName != "" {
pbAssets, err = getDataFromBucket(inventory.bucketName)
if err != nil {
return nil, errors.Wrap(err, "Fetching inventory from Bucket")
}
} else if inventory.dirPath != "" {
pbAssets, err = getDataFromFile(inventory.dirPath)
if err != nil {
return nil, errors.Wrap(err, "Fetching inventory from local directory")
}
} else if inventory.readFromStdin {
pbAssets, err = getDataFromStdin()
if err != nil {
return nil, errors.Wrap(err, "Reading from stdin")
}
}
duration := time.Since(start)
fmt.Println("Fetched data in: ", duration)
start = time.Now()
fmt.Println("Reviewing Violations: ", start)
richViolations := make([]*RichViolation, 0)
wp := workerpool.New(5)
for _, asset := range pbAssets {
asset := asset
wp.Submit(func() {
violations, err := config.validator.ReviewAsset(context.Background(), asset)
if err != nil {
//return nil, errors.Wrapf(err, "reviewing asset %s", asset)
}
for _, violation := range violations {
richViolation := RichViolation{*violation, "", violation.Resource, violation.Message, violation.Metadata, asset}
richViolations = append(richViolations, &richViolation)
//richViolations <- richViolation
}
})
}
wp.StopWait()
duration = time.Since(start)
fmt.Println("Reviewed Violations in: ", duration)
return richViolations, nil
}
For 400MB inventory file, its a very good improvement
Generating CFT scorecard
Fetching data: 2021-07-21 17:39:16.939089 +1000 AEST m=+1.138443251
Fetched data in: 1m22.735131396s
Reviewing Violations: 2021-07-21 17:40:39.673351 +1000 AEST m=+83.873593801
Reviewed Violations in: 13m8.961828329s
Time taken to write results: 7.194955508s
This does look like a big improvement, thanks for working on it! Assuming you're using this workerpool library, I think that is fine.
The next step would be for you to submit a pull request where we can discuss the code details.
Could I be given permission to push my branch please? remote: Permission to GoogleCloudPlatform/cloud-foundation-toolkit.git denied to hussainak. fatal: unable to access 'https://github.com/GoogleCloudPlatform/cloud-foundation-toolkit.git/': The requested URL returned error: 403
You should make a fork: https://gist.github.com/Chaser324/ce0505fbed06b947d962
On Thu, Jul 22, 2021 at 01:39 hussainak @.***> wrote:
Could I be given permission to push my branch please? remote: Permission to GoogleCloudPlatform/cloud-foundation-toolkit.git denied to hussainak. fatal: unable to access ' https://github.com/GoogleCloudPlatform/cloud-foundation-toolkit.git/': The requested URL returned error: 403
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/cloud-foundation-toolkit/issues/937#issuecomment-884667710, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMNNGJGKTFZ6UYK6DBWFATTY6VH3ANCNFSM46NLM2NQ .
--
Morgante Pell
Anthos Solutions Architect
Config & Policy Automation
@.***
Hi @morgante
I have created a pull request for this here: https://github.com/GoogleCloudPlatform/cloud-foundation-toolkit/pull/964
Just a quick note that I have added a numWorkers() method based on number of CPUs available to Go. However I found limiting it to 4 workers give better performance so I have limited it. Performance results below:
4 workers 400MB
time ./cft scorecard --policy-path /Users/hak/Desktop/Development/GCP/gcp-policy-library/policy-library --dir-path /Users/hak/Desktop/Development/GCP/cft-inventory --output-path . --output-format csv --concurrency
Generating CFT scorecard
Fetching data: 2021-07-23 13:56:02.797504 +1000 AEST m=+1.121431290
Fetched data in: 1m21.072152958s
Reviewing Violations: 2021-07-23 13:57:23.968598 +1000 AEST m=+82.193605434
Reviewed Violations in: 11m46.900796727s
Time taken to write results: 6.429203758s
./cft scorecard --policy-path --dir-path --output-path . --output-format cs 3055.22s user 746.32s system 449% cpu 14:05.97 total
12 workers 400MB
time ./cft scorecard --policy-path /Users/hak/Desktop/Development/GCP/gcp-policy-library/policy-library --dir-path /Users/hak/Desktop/Development/GCP/cft-inventory --output-path . --output-format csv --concurrency
Generating CFT scorecard
Fetching data: 2021-07-23 13:26:54.641019 +1000 AEST m=+1.221005861
Fetched data in: 1m25.321140218s
Reviewing Violations: 2021-07-23 13:28:19.962972 +1000 AEST m=+86.542165822
Reviewed Violations in: 22m43.458904001s
Time taken to write results: 7.503422589s
./cft scorecard --policy-path --dir-path --output-path . --output-format cs 4934.00s user 3166.56s system 529% cpu 25:29.87 total
Hi @morgante I have pushed the changes again, two things I'd point out are:
You should probably have a new command under cft to query number of cpus and procs available on a system.
We could also accept -concurrency 0
to auto-detect the preferred concurrency.
We have seen a massive (7x) improvement at our company. This ticket can be closed now.
Hi there,
We are running cft scorecard against a large resource inventory file and it takes almost an hour to run against policy library and output the scorecard. Can this be improved? Resource Inventory file is about 500MB.