JIT HTTP 403 only for a certain group

Mantook11 commented 9 months ago

Hey, we are facing a very weird issue where JIT is resulting in a 403 only for the members of a certain group. The service account and its roles have been double checked and the deployment was repeated by following the JIT documentation twice.

The group in question is a group that provides read access on every project in the organization next to some other roles.

What could be causing this issue? Its very confusing that having more permissions results in a 403.

Here is the error: Loading projects failed: Listing available projects failed, see logs for details (HTTP 403: error)

jpassing commented 9 months ago

Could you have a look at the App Engine/Cloud Run logs? There should be a log message Listing available projects failed with some additional information about the root cause. Maybe this additional information contains a hint on why access is being denied.

I suppose one explanation might be that there's an IAM Deny policy in place somewhere.

Mantook11 commented 9 months ago

"Listing available projects failed: Read timed out" is the only message I can see. When I expand it I see that the call is towards "api.listEligibleRoles". But this is not the only api call that gives this error. "Listing project roles failed: Read timed out" is the message when we search for a specific project. But also when we are trying to approve somebodies elevation request, we get a 403.

I also checked the deny policies on the organization level and the folder level where the projects are and there seems to be no IAM deny policies.

danjamesmay commented 7 months ago

I'm seeing this too. In my jit deployed project I'm seeing:

{
  "textPayload": "Listing project roles failed: Read timed out",
  "insertId": "",
  "resource": {
    "type": "cloud_run_revision",
    "labels": {
      "configuration_name": "jitaccess",
      "location": "europe-west1",
      "service_name": "jitaccess",
      "revision_name": "jitaccess-xx",
      "project_id": "xx_jit_project"
    }
  },
  "timestamp": "2023-11-27T16:46:12.457574Z",
  "severity": "ERROR",
  "labels": {
    "project": "xx_scoped",
    "event": "api.listEligibleRoles",
    "instanceId": "",
    "error": "SocketTimeoutException",
    "user": "authenticated_user",
    "user_id": "accounts.google.com:115917957702200051582",
    "device_access_levels": "",
    "device_id": "unknown"
  },
  "logName": "projects/jit-access/logs/run.googleapis.com%2Fstdout",
  "trace": "",
  "receiveTimestamp": "2023-11-27T16:46:12.466089919Z"
}

In the scoped project I'm seeing the following 4 times with more or less the same timestamp, like it's spamming without waiting long enough for the api to return info.

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "code": 1,
      "message": "Cancelled by client"
    },
    "authenticationInfo": {
      "principalEmail": "jit-access@xx_runtime_sa.iam.gserviceaccount.com",
      "serviceAccountDelegationInfo": [
        {
          "firstPartyPrincipal": {
            "principalEmail": "service-xx_scoped@serverless-robot-prod.iam.gserviceaccount.com"
          }
        }
      ]
    },
    "requestMetadata": {
      "callerIp": "private",
      "requestAttributes": {
        "time": "2023-11-27T16:46:22.700385Z",
        "auth": {}
      },
      "destinationAttributes": {}
    },
    "serviceName": "cloudasset.googleapis.com",
    "methodName": "google.cloud.asset.v1.AssetService.SearchAllIamPolicies",
    "authorizationInfo": [
      {
        "resource": "projectsxx_scopeds",
        "permission": "cloudasset.assets.searchAllIamPolicies",
        "granted": true,
        "resourceAttributes": {
          "service": "cloudresourcemanager.googleapis.com",
          "name": "projects/xx_scoped",
          "type": "cloudresourcemanager.googleapis.com/Project"
        }
      }
    ],
    "resourceName": "projects/xx_scoped",
    "request": {
      "@type": "type.googleapis.com/google.cloud.asset.v1.SearchAllIamPoliciesRequest"
    }
  },
  "insertId": "bxxy2ed886e",
  "resource": {
    "type": "audited_resource",
    "labels": {
      "project_id": "xx_scoped",
      "method": "google.cloud.asset.v1.AssetService.SearchAllIamPolicies",
      "service": "cloudasset.googleapis.com"
    }
  },
  "timestamp": "2023-11-27T16:46:22.411269Z",
  "severity": "ERROR",
  "logName": "projects/xx_scoped/logs/cloudaudit.googleapis.com%2Fdata_access",
  "receiveTimestamp": "2023-11-27T16:46:23.694104957Z"
}

Looks like it's taking too long and the app is timing out... is there a way to increase the timeout to test this?

EDIT: Tried again and looks like the SearchAllIamPoliciesRequest API call returns around 10 seconds after Listing project roles failed: Read timed out is logged by the app.

I can see there's meant to be a 30 second timeout on the api (https://github.com/GoogleCloudPlatform/jit-access/blame/1965e26cd7d6183f8b9d6fe59fdee333db3d71d4/sources/src/main/java/com/google/solutions/jitaccess/core/adapters/AssetInventoryAdapter.java#L104), so not sure it's waiting the full 30 seconds.

...

EDIT 2: more digging...

Can see the UI is calling https://xx_jit_project.dunnhumby.cloud/api/projects/xx_scoped_project_id/roles under the hood and this seems to be timing out consistently at 20s:

EDIT 3:

So I deployed a custom image with that ANALYZE_IAM_POLICY_TIMEOUT_SECS timeout set to 2 minutes and still saw the 20 second timeout in the UI... Stumped now, it must be a symptom of another misconfiguration.

jpassing commented 7 months ago

So I deployed a custom image with that ANALYZE_IAM_POLICY_TIMEOUT_SECS timeout set to 2 minutes and still saw the 20 second timeout in the UI... Stumped now, it must be a symptom of another misconfiguration.

That's an interesting observation. Do you use App Engine or Cloud Run? And could you share the server's response from the Response tab? The response should give an indication on whether it's the application that's aborting the request, the load balancer, or the client-side JS.

Mantook11 commented 7 months ago

We actually managed to solve the issue by creating a new adapter which extends the credential HTTP adapter and sets the timeout of the http request to a value that is higher. Not a nice way to do things but for now it gets the job done.

danjamesmay commented 7 months ago

So I deployed a custom image with that ANALYZE_IAM_POLICY_TIMEOUT_SECS timeout set to 2 minutes and still saw the 20 second timeout in the UI... Stumped now, it must be a symptom of another misconfiguration.

That's an interesting observation. Do you use App Engine or Cloud Run? And could you share the server's response from the Response tab? The response should give an indication on whether it's the application that's aborting the request, the load balancer, or the client-side JS.

I'm using Cloud Run for this deployment. These are the response headers:

Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
Cache-Control: no-cache
Content-Length: 64
Content-Type: application/json
Date: Wed, 29 Nov 2023 17:09:48 GMT
Server: Google Frontend

Via: 1.1 google
X-Cloud-Trace-Context: abb9a4e802c7e883cd3c602e8b62181c;o=1

I initially deployed this app completely manually through the UI and everything worked fine. Deployed everything via Terraform via google's apis and everything works except this strange timeout.

I did notice when the RESOURCE_SCOPE is set incorrectly (i.e. is set to a resource where the runtime SA doesn't have roles/securityAdmin and roles/assetViewer), the api does seem to respond within a couple seconds max saying the runtime account doesn't have access to check the scope. So I feel like the app is able to talk to the API as it's getting a proper response. It's the Read timed out error message that's bugging me.

I feel like if there was a debug mode to print more verbose info on where this error originates in the underlying google api library that this app is using, we could get a better idea.

This line seems to be where the error is coming from for project listing... and for the role listing

The error is the same for both api calls (the project listing and role listing), so talking about these interchangeably now.

I'd love to contribute some optional verbose logging but will likely just hack in some extra log lines to try and get some more info.

EDIT: I've also thought about simulating the api call the app is doing under the hood, to see long it's actually taking. Although both my manually deployed (working solution) and TF deployed (broken) solution are performing seemingly the same api call as they're all scoped to the same testing project and I'm the user, so there shouldn't be a difference here.

It sounds like @Mantook11 is talking about a different timeout than the one I am, possibly in the underlying api library? I wonder if you wouldn't mind pointing me to the direction of where this can be configured?

jpassing commented 7 months ago

@Mantook11, I guess you're right... the client library uses a default read time out of 20 seconds, which would match the 20-second timeout that @danjamesmay has been observing.

I'll prepare a fix to make this timeout configurable.

jpassing commented 7 months ago

@danjamesmay, the master branch now contains a fix that lets you configure the timeouts for HTTP requests made to backend Google APIs.

Could you try adding something like the following to your service YAML?

        - name: BACKEND_READ_TIMEOUT 
          value: '90'

danjamesmay commented 7 months ago

@jpassing thanks for the quick turnaround adding this, it's very much appreciated!

Both project and role listing is now successful, seems to be 37s for both in my case. Hopefully that's something the underlying api will improve upon as time goes on, but for now this is definitely a great improvement.

jpassing commented 7 months ago

Both project and role listing is now successful,

Great, thanks for confirming. I suppose we can close this issue then.

danjamesmay commented 7 months ago

Yes, I think we should possibly open a new issue around the latency problem. There must be a combination of factors that result in excessively high latency for these api calls and it would be good to find and document what they are so users can avoid long waiting times.

EDIT: Just seen https://github.com/GoogleCloudPlatform/jit-access/issues/180 so all good !

GoogleCloudPlatform / jit-access

JIT HTTP 403 only for a certain group #177