Improve Communication of Permission Errors

cloudquery / cloudquery

The open source high performance ELT framework powered by Apache Arrow

https://cloudquery.io

Mozilla Public License 2.0

5.87k stars 513 forks source link

Improve Communication of Permission Errors #1385

Closed bbernays closed 2 years ago

bbernays commented 2 years ago

in cloudquery/cq-provider-aws#955 user reported that they were missing specific resources even though they were using the resource = ["*"]. All of the logs and outputs clearly indicated that there were no errors and everything fetched successfully. In order to understand what was going on user had to rerun fetch with the -v argument.

CloudQuery needs to improve communication to user when there is a permission error. The communication needs to tell the user how to dig into the issue while not overly alarming them as in many environments there will be restrictions that don't enable the user to grab all resources which is perfectly valid.

dancrumb commented 2 years ago

As said user from cloudquery/cq-provider-aws#955 I have some roughly shaped thoughts.

At the top, I'd say a little documentation about permission quirks will help:

Explicitly documenting that ReadOnlyAccess should suffice
Noting a few permission errors that tend not to be a problem
Maybe provide some sort of permission problem summary. This could be via a flag or perhaps a script that parses a verbose log and summarizes
My biggest operational concern was that resources were missing with no indication that there might be a problem to investigate. An overt indication of that would help.
Perhaps an ability to ignore certain permission problems in the config would allow louder errors wherever a real problem occurs

Just some rough-clay thoughts to get the ball rolling

bbernays commented 2 years ago

Thank you very much for this! We really appreciate the feedback!

bbernays commented 2 years ago

We have identified a regression that led to errors getting misclassified. When you run a fetch and have permissions issues you see the following warnings:

Provider fetch complete.

Fetch Diagnostics:

Resource: ec2.instances Type: Access Severity: Ignore
        Summary: table "aws_ec2_instances" resolver ignored error: EC2: DescribeInstances - You are not authorized to perform this operation.
        Detail: You are not authorized to perform this operation. Check your IAM policies, and ensure that you are using the correct access keys. [Repeated:17]

Provider aws fetch summary: ✓ Total Resources fetched: 0    ⚠️ Warnings: 0   ❌ Errors: 0

roneli commented 2 years ago

Should be fixed in cloudquery/cloudquery#1384

shimonp21 commented 2 years ago

In the code, in almost all tables, we call IgnoreCommonErrors, which ignores ACCESS DENIED errors.

https://github.com/cloudquery/cq-provider-aws/blob/fcd04d58b862c4c1a1c4bdb0299b9f90a5a01ffb/client/helpers.go#L187

shimonp21 commented 2 years ago

When looking at this issue, we need to remember to consider the interaction and differences between IgnoreError and ErrorClassifier.

yevgenypats commented 2 years ago

This should be fixed in v2 with consistent and structured logging. We report all errors from API in warn level with the original errors. We are not doing extra summary because this all depends on the user use-case and interest so it is best to create rules, filter with jq / Loki to look for whatever is interesting for each use-case and plugin (i.e permission errors, tables, etc...)