GoogleCloudPlatform / security-response-automation

Take automated actions against threats and vulnerabilities.
Apache License 2.0
208 stars 52 forks source link

Add additional context to errors #151

Closed daniel-cit closed 4 years ago

daniel-cit commented 4 years ago

Changes:

daniel-cit commented 4 years ago

Before the change

 "Function error: googleapi: Error 400: Invalid request: Invalid request since instance is not running., invalid

After the change

 "Function error: failed to execute update password automation with ProjectID: "decent-ellipse-262323", InstanceName: "test", Host: "%!"(MISSING) , UserName: "root" and password: ******: failed while performing update root password for MySQL instance "test" on project "decent-ellipse-262323": failed to update user password instance: "test" on project "decent-ellipse-262323": googleapi: Error 400: Invalid request: Invalid request since instance is not running., invalid

New log examples:

"2019/12/28 01:03:05 failed to publish action "remove_public_ip" for rule "public_ip_address": "failed to publish to \"threat-findings-remove-public-ip\" for action \"remove_public_ip\": rpc error: code = PermissionDenied desc = User not authorized to perform this action."" 
"Function error: failed to execute remove public IP automation with values: {ProjectID:decent-ellipse-262323 InstanceZone:us-central1-a InstanceID:4671494672883189653 DryRun:false}: failed to remove public ip: failed to get instance: "googleapi: Error 403: Required 'compute.instances.get' permission for 'projects/decent-ellipse-262323/zones/us-central1-a/instances/4671494672883189653', forbidden"

I think that it is important to log the values if we got an error, it is not possible to find out which finding was processed without this info. It can be changed to log the values individually, %q %q %q instead of using just the %+v.

Some cleaning can be done on the inner call to not log the same values again but I think that the outermost one in exec.go is necessary @tomscript

tomscript commented 4 years ago

Before the change

 "Function error: googleapi: Error 400: Invalid request: Invalid request since instance is not running., invalid

After the change

 "Function error: failed to execute update password automation with ProjectID: "decent-ellipse-262323", InstanceName: "test", Host: "%!"(MISSING) , UserName: "root" and password: ******: failed while performing update root password for MySQL instance "test" on project "decent-ellipse-262323": failed to update user password instance: "test" on project "decent-ellipse-262323": googleapi: Error 400: Invalid request: Invalid request since instance is not running., invalid

New log examples:

"2019/12/28 01:03:05 failed to publish action "remove_public_ip" for rule "public_ip_address": "failed to publish to \"threat-findings-remove-public-ip\" for action \"remove_public_ip\": rpc error: code = PermissionDenied desc = User not authorized to perform this action."" 

"Function error: failed to execute remove public IP automation with values: {ProjectID:decent-ellipse-262323 InstanceZone:us-central1-a InstanceID:4671494672883189653 DryRun:false}: failed to remove public ip: failed to get instance: "googleapi: Error 403: Required 'compute.instances.get' permission for 'projects/decent-ellipse-262323/zones/us-central1-a/instances/4671494672883189653', forbidden"

ya i see there's more information here but this is not very readable right? here's an example of createsnapshot that fails. you can see here we already unnecessarily wrap a few things here:

image

failed creating snapshot: "forensic-snapshots-bad-ip-instance-5": failed waiting: first error: fail:

see how many times we wrap? basically every time we say "fail" so is that providing help to the user? does it really help that we list the snapshot? no. we should just report the error by return err which shows:"RESOURCE_ALREADY_EXISTS" goroutine 466 [running]:... why do we wrap with "failed waiting" or "fail"

the context for the error is right above. notice how i only see logs for that specific execution. this is the result of clicking on the filter in the other screenshot. no need for the error to contain anything really except the error.



I think that it is important to log the `values` if we got an error, it is not possible to find out which finding was processed without this info. It can be changed to log the values individually, %q %q %q instead of using just the %+v.

it is not possible to find out which finding was processed without this info

this is possible. when these errors are displayed in stackdriver there is a uuid you can click on so you see all the logs / errors from that run.

image

Some cleaning can be done on the inner call to not log the same values again but I think that the outermost one in exec.go is necessary

this is something to consider. before we do so we should take a step back and understand what is it we're trying to solve. i dont think users of SRA will benefit from a lot of this but rather a developer working on the framework.

in either case we should remember that the user and developer will see the entire log for the specific execution (as shown in above screenshot) this means you don't need extremely verbose errors because you'll see something like this:

- attempting to do thing foo
- thing failed

the thing failed error doesn't need to be overly complicated, tell the project, tell the automation name because all ofthat is in the same log

@tomscript

daniel-cit commented 4 years ago

@tomscript Is there anything added in this PR that you would want to keep, so it could be moved to a separated PR?

tomscript commented 4 years ago

probably for now the router automation folder grant and roles/brower grant

daniel-cit commented 4 years ago

To be redone(maybe) after further alignment.