CaliDog / certstream-server

Certificate Transparency Log aggregation, parsing, and streaming service written in Elixir
https://certstream.calidog.io
MIT License
271 stars 75 forks source link

Degradation of certificate stream after one week #8

Closed artgl closed 4 years ago

artgl commented 5 years ago

Just after certstream-server is started client retrieves up to 300 certs/sec. After one week of continuous work client shows zero number of updates. This is what I see in latest server logs:

`16:54:13.822 [info] Worker #PID<0.253.0> with url ct.googleapis.com/logs/xenon2022/ found 1 certificates [6978 -> 6979].

17:13:06.601 [info] Worker #PID<0.310.0> with url nessie2022.ct.digicert.com/log/ found 1 certificates [3460 -> 3461]. 17:13:07.499 [info] Worker #PID<0.305.0> with url nessie2020.ct.digicert.com/log/ found 11 certificates [67275 -> 67286]. 17:13:07.736 [info] Worker #PID<0.293.0> with url yeti2022.ct.digicert.com/log/ found 1 certificates [3703 -> 3704]. 18:13:01.839 [info] Worker #PID<0.305.0> with url nessie2020.ct.digicert.com/log/ found 10 certificates [67286 -> 67296]. 18:54:09.569 [info] Worker #PID<0.253.0> with url ct.googleapis.com/logs/xenon2022/ found 1 certificates [6979 -> 6980]. 19:12:56.189 [info] Worker #PID<0.305.0> with url nessie2020.ct.digicert.com/log/ found 10 certificates [67296 -> 67306]. 19:13:09.950 [info] Worker #PID<0.310.0> with url nessie2022.ct.digicert.com/log/ found 1 certificates [3461 -> 3462]. 20:12:59.092 [info] Worker #PID<0.310.0> with url nessie2022.ct.digicert.com/log/ found 1 certificates [3462 -> 3463]. 20:13:02.009 [info] Worker #PID<0.293.0> with url yeti2022.ct.digicert.com/log/ found 1 certificates [3704 -> 3705]. 20:13:05.944 [info] Worker #PID<0.305.0> with url nessie2020.ct.digicert.com/log/ found 10 certificates [67306 -> 67316]. 20:54:05.769 [info] Worker #PID<0.253.0> with url ct.googleapis.com/logs/xenon2022/ found 2 certificates [6980 -> 6982]. 21:13:00.290 [info] Worker #PID<0.305.0> with url nessie2020.ct.digicert.com/log/ found 11 certificates [67316 -> 67327]. 22:13:02.392 [info] Worker #PID<0.310.0> with url nessie2022.ct.digicert.com/log/ found 1 certificates [3463 -> 3464]. 22:13:03.354 [info] Worker #PID<0.293.0> with url yeti2022.ct.digicert.com/log/ found 1 certificates [3705 -> 3706]. 22:13:09.709 [info] Worker #PID<0.305.0> with url nessie2020.ct.digicert.com/log/ found 11 certificates [67327 -> 67338]. 22:54:01.630 [info] Worker #PID<0.253.0> with url ct.googleapis.com/logs/xenon2022/ found 1 certificates [6982 -> 6983]. 23:12:56.454 [info] Worker #PID<0.293.0> with url yeti2022.ct.digicert.com/log/ found 1 certificates [3706 -> 3707]. 23:12:56.563 [info] Worker #PID<0.310.0> with url nessie2022.ct.digicert.com/log/ found 1 certificates [3464 -> 3465]. 23:13:04.060 [info] Worker #PID<0.305.0> with url nessie2020.ct.digicert.com/log/ found 10 certificates [67338 -> 67348]. 00:12:58.315 [info] Worker #PID<0.305.0> with url nessie2020.ct.digicert.com/log/ found 11 certificates [67348 -> 67359]. 00:13:04.800 [info] Worker #PID<0.293.0> with url yeti2022.ct.digicert.com/log/ found 1 certificates [3707 -> 3708].`

It seems that most server threads which extracts domains from separate sources are dead, and only 4 threads are functional for now.

This bug repeats both on remote machine with old centos distr and on my working machine with Ubuntu 18.04. Erlang version on both machines: [{release,"Erlang/OTP","21","10.2.3", [{kernel,"6.2","/usr/lib/erlang/lib/kernel-6.2"}, {stdlib,"3.7","/usr/lib/erlang/lib/stdlib-3.7"}, {sasl,"3.3","/usr/lib/erlang/lib/sasl-3.3"}], permanent}].

lukyer commented 5 years ago

Same here, even just consuming "official" CaliDog websocket seems to have this issue (very weird events distribution like 1 cert per minute and next day consistently 100 certs per second)

Fitblip commented 5 years ago

Hi there, it turns out that Heroku's daily dyno restart was actually masking an issue with the service which basically meant the supervisor tree was never fully initialized, and therefore wasn't prepared to properly restart things when errors occurred (leading to a slow, but difficult to diagnose degradation in service).

I have since fixed this, both in master and at certstream.calidog.io, please let me know if you experience further issues, and sorry for the breakage.

aidansteele commented 4 years ago

@Fitblip I'm not sure if it's related to this issue, but I haven't seen any certificates come through the official websocket for about the last 90 minutes. I've confirmed it's not my iffy code by running the official CLI tool and checking the website - both have the same behaviour.

Is it a dyno thing again?

Fitblip commented 4 years ago

Howdy @aidansteele - this is actually an old github issue that I forgot to close and unrelated to the pipeline being down.

We have an issue with our provider currently that we're working through (there's been a good deal of flapping in the past few days, so the pipeline actually going down was ignored due to the false alerts - sorry for the downtime).

aidansteele commented 4 years ago

No problem at all! It’s a really amazing service, thank you for producing it and hosting it 👍