crtsh / certwatch_db

Database schema
https://crt.sh/
GNU General Public License v3.0
194 stars 36 forks source link

High amount of SQL server timeouts #84

Closed RufusJWB closed 2 years ago

RufusJWB commented 2 years ago

Since mid of January I get more server timeouts than before when executing SQL statements on crt.sh. Did you re-configure something? Even simple statements like

SELECT DISTINCT
    ca_id,
    url,
    post_dump dump
FROM
    public.ocsp_responder
WHERE
    position(convert_to ('200 OK', 'UTF8') in post_dump) > 0

run into time outs.

RufusJWB commented 2 years ago

this is also reflected in errors on the web page: image

robstradling commented 2 years ago

@RufusJWB The replication (aka "recovery") to one of the replica databases that powers crt.sh:5432 had fallen several days behind. Last night I took this replica out of the pool, and since then (without the "burden" of user queries) it's been catching up. I'd estimate another 24hrs or so to go. I'm guessing that this is the cause of the "conflict with recovery" mentioned on that web page. We'll know soon!

RufusJWB commented 2 years ago

Thank you for clarifying! I am wondering if crt.sh database is pushed beyond the boundaries of postgreqsl. Did you ever think about migrating to a better scaling RDBMS? We have started to work with CockroachDB and had surprisingly good experiences with it. I am mentioning CockroachDB because they claim to be compatible with postgrsql.

robstradling commented 2 years ago

@RufusJWB I haven't looked at CockroachDB, but a quick read of https://devathon.com/blog/cockroachdb-vs-mysql-vs-postgresql-vs-mongodb-vs-cassandra/ suggests to me that PostgreSQL still has some advantages for crt.sh's use cases.

robstradling commented 2 years ago

@RufusJWB It wasn't the struggling replica that was causing the https://crt.sh/ocsp-responders load errors. I've just located the issue...

At some point, presumably quite recently, GTS has started putting a unique OCSP URL in each leaf certificate; and since https://crt.sh/ocsp-responders loops through every single one in its attempt to render the page, the sheer volume of data is causing it to time out.

certwatch@certwatch=> select count(*), ca.id, ca.name from ocsp_responder, ca where ca_id=ca.id group by ca.id order by count(*) desc limit 10;

┌─────────┬────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  count  │   id   │                                                            name                                                            │
├─────────┼────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1567677 │ 180754 │ C=US, O=Google Trust Services LLC, CN=GTS CA 1D4                                                                           │
│   39478 │ 180753 │ C=US, O=Google Trust Services LLC, CN=GTS CA 1P5                                                                           │
│     338 │ 180750 │ C=US, O=Google Trust Services LLC, CN=GTS CA 2A1                                                                           │
│     172 │     -1 │ Issuer Not Found                                                                                                           │
│      22 │   1597 │ CN=Apple IST CA 2 - G1, OU=Certification Authority, O=Apple Inc., C=US                                                     │
│      13 │   1613 │ CN=Apple Server Authentication CA, OU=Certification Authority, O=Apple Inc., C=US                                          │
│      12 │   1155 │ C=US, O=Apple Inc., OU=Apple Worldwide Developer Relations, CN=Apple Worldwide Developer Relations Certification Authority │
│       6 │      1 │ C=SE, O=AddTrust AB, OU=AddTrust External TTP Network, CN=AddTrust External CA Root                                        │
│       6 │  13797 │ CN=Developer ID Certification Authority, OU=Apple Certification Authority, O=Apple Inc., C=US                              │
│       6 │     36 │ C=BE, O=GlobalSign nv-sa, OU=Root CA, CN=GlobalSign Root CA                                                                │
└─────────┴────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
robstradling commented 2 years ago

Hopefully @rmhrisk will set me straight if I've got this wrong...

It appears that these GTS CAs also operate a shared OCSP Responder endpoint in addition to the sharded responder URLs that are embedded in the certs. e.g., http://ocsp.pki.goog/s/gts1p5/ZzxHLs7BAJA (sharded), http://ocsp.pki.goog/gts1p5 (shared)

I propose to make crt.sh record/use the "shared" endpoints and ignore/discard the "sharded" ones.

RufusJWB commented 2 years ago

That's amazing! I'll later check if they have dedicated OCSP signer certificates per sharded endpoint.

robstradling commented 2 years ago

I haven't checked that, but from previous conversations I'm pretty sure GTS is direct-signing all OCSP responses in order to minimize response sizes.

robstradling commented 2 years ago

I propose to make crt.sh record/use the "shared" endpoints and ignore/discard the "sharded" ones.

Done. http://crt.sh/ocsp-responders now loads. :-)