EngineerBetter / concourse-up

Deprecated - used Control Tower instead
https://github.com/EngineerBetter/control-tower
Apache License 2.0
203 stars 29 forks source link

RDS filled up #35

Closed archgrove closed 6 years ago

archgrove commented 6 years ago

We use concourse-up to manage our Concourse. Our usage is, I believe, fairly mundane. A dozen or so pipelines, with a reasonable amount of credentials managed by the bundled credhub. We have three teams (including main), authenticated via Github Oauth.

Our setup failed this morning. Manifestations were credhub-cli rejecting logins with “bad credentials”, and git resource checks failing. The git resource was complaining about pgsql disk space usage; alas, I did not keep the exact error.

Checking RDS, the Postgres disk had indeed filled up - all 10 gigs. I resized it to restore service, and tunnelled in to find database usage of:

     name      |           owner           |   size    
---------------+---------------------------+-----------
 rdsadmin      | rdsadmin                  | No Access
 credhub       | adminby6djcbv1rdm3k63n7j7 | 8945 MB
 concourse_atc | adminby6djcbv1rdm3k63n7j7 | 181 MB
 bosh          | adminby6djcbv1rdm3k63n7j7 | 12 MB
 uaa           | adminby6djcbv1rdm3k63n7j7 | 8935 kB
 template1     | adminby6djcbv1rdm3k63n7j7 | 7343 kB
 template0     | rdsadmin                  | 7233 kB
 postgres      | adminby6djcbv1rdm3k63n7j7 | 7233 kB

The relation size in credhub was:

               relation                |  size   
---------------------------------------+---------
 public.request_audit_record           | 5085 MB
 public.event_audit_record             | 2465 MB
 public.event_audit_record_pkey        | 693 MB
 public.request_audit_record_pkey      | 691 MB
 public.auth_failure_audit_record      | 832 kB
 pg_toast.pg_toast_2618                | 376 kB
 pg_toast.pg_toast_2619                | 72 kB
 public.encrypted_value                | 72 kB
…truncated…

I’m not a credhub expert. Things I guess might be useful in diagnosing this:

  1. select count (distinct uaa_url) from request_audit_record gives 1; the record is https://an_ip:8443/oauth/token
  2. select count(*) from request_audit_record; gives 17735751
  3. A random selection the rows in request_audit_record gives entries similar to:


18c85002-8fae-4d7e-9aa4-bad4610f9e43 | 127.0.0.1 | 1516213545469 | /api/v1/data | 127.0.0.1 | 1516210813 | 1516214413 | https://an_ip:8443/oauth/token | | | | credhub.write,credhub.read | client_credentials | atc_to_credhub | GET | 200 | path=<73 characters redacted> | uaa



  1. select count(*) from event_audit_record gives 17740039
  2. select operation, count(*) from event_audit_record group by operation; gives
     operation     |  count  
-------------------+---------
 credential_update |     101
 credential_delete |      23
 credential_find   | 8872644
 acl_update        |     255
 credential_access | 8872480
  1. Records in event_audit_record have the form:

b2b1aa8a-10e1-4777-b742-07df841918fb | 7d6d796e-d391-496f-90bd-253ed2cc55c0 | 1516111765973 | credential_update | <redacted 73 characters of credential path> | uaa-user:94d61c71-12e4-42ce-9d59-03292aa2c382 | t

Evidently, something about our setup is causing an unexpectedly large number of credhub uses (perhaps the constant git polling?). I will leave the tables intact for a few days in case they are useful for further diagnostics, but will have to truncate them sooner rather than later.

Let me know what I can do to help!

CC @jpluscplusm

archgrove commented 6 years ago

@jpluscplusm has pointed me at https://bosh.io/jobs/credhub?source=github.com/pivotal-cf/credhub-release&version=1.7.2#p=credhub.log_level, which seems likely to be useful.

danyoung commented 6 years ago

Hi Adam, thanks for raising this issue. We've also seen one of our own deployments suffer a Postgres storage issue and we're looking into it. There are also some other issues relating to 3.9.0 that are causing some headaches. Please stay tuned!

archgrove commented 6 years ago

Thanks @danyoung ! I've done some more Credhub level splunking, and it looks like the audit logs are non-negotiable (they can't be turned off, or turned down). So in some ways, this feels like a credhub "issue".

In terms of concourse-up, a post-insert TRIGGER to GC old audit entries might be viable? Or if we're feeling feisty, an actual cron job for cleanup?

danyoung commented 6 years ago

@archgrove we have a bug to address this issue in concourse-up https://www.pivotaltracker.com/story/show/155592413

danyoung commented 6 years ago

@archgrove Please try the latest release for a fix to this issue https://github.com/EngineerBetter/concourse-up/releases/tag/0.8.3