SSLMate / certspotter

Certificate Transparency Log Monitor
https://sslmate.com/certspotter
Mozilla Public License 2.0
962 stars 83 forks source link

Another instance is already running -- Certspotter cannot complete it's task #35

Closed Tronde closed 4 years ago

Tronde commented 5 years ago

Summary

Certspotter is started via cron job but the job never comes to an end. When cron tries to start the next job as scheduled this fails because another instance of certspotter is already running and the lock file prevents the start of another instance.

When the running process is stopped (kill PID) and the lock file has been removed the problem reoccurs after the next time certspotter was started. So a process is started and cannot complete next process is prevented from starting.

Environment

Current status

In this section I try to provide complete information about the current process status.

Crontab entry

42 7 * * * /home/tronde/go/bin/certspotter

Related processes

tronde@host:~$ ps -fp 20273 -p 20274 -p 20275
UID        PID  PPID  C STIME TTY          TIME CMD
root     20273   691  0 Oct06 ?        00:00:00 /usr/sbin/CRON -f
tronde 20274 20273  0 Oct06 ?        00:00:00 /bin/sh -c /home/tronde/go/bin/certspotter
tronde 20275 20274  8 Oct06 ?        02:03:58 /home/tronde/go/bin/certspotter
tronde@host:~$

Open files

tronde@host:~$ sudo lsof -p 20275
COMMAND     PID      USER   FD      TYPE   DEVICE SIZE/OFF      NODE NAME
certspott 20275 tronde  cwd       DIR      9,2     4096  20709378 /home/tronde
certspott 20275 tronde  rtd       DIR      9,2     4096         2 /
certspott 20275 tronde  txt       REG      9,2  7173364  57409601 /home/tronde/go/bin/certspotter
certspott 20275 tronde  mem       REG      9,2  1824496 117712893 /usr/lib/x86_64-linux-gnu/libc-2.28.so
certspott 20275 tronde  mem       REG      9,2   146968 117712993 /usr/lib/x86_64-linux-gnu/libpthread-2.28.so
certspott 20275 tronde  mem       REG      9,2   165632 117712835 /usr/lib/x86_64-linux-gnu/ld-2.28.so
certspott 20275 tronde    0r     FIFO     0,12      0t0  32416847 pipe
certspott 20275 tronde    1u      REG      9,2        0  57409560 /tmp/#57409560 (deleted)
certspott 20275 tronde    2u      REG      9,2        0  57409560 /tmp/#57409560 (deleted)
certspott 20275 tronde    3r      REG      9,2       16  20709927 /home/tronde/.certspotter/watchlist
certspott 20275 tronde    4u  a_inode     0,13        0      9331 [eventpoll]
certspott 20275 tronde    5u     IPv6 33299581      0t0       TCP host:36266->fra16s20-in-x0a.1e100.net:https (ESTABLISHED)
tronde@host:~$

Strace of PID

sudo strace -p 20275
strace: Process 20275 attached
futex(0x919aa0, FUTEX_WAIT_PRIVATE, 0, NULL
^Cstrace: Process 20275 detached
 <detached ...>

Conclusion

Strace was attached to the PID for round about 1 Minute. It seems that the process got stuck in futex for any reason and therefor cannot complete.

My Request

I would like to ask you if you could assist in troubleshooting this issue to find out if this is a bug or just a configuration issue on my machine. I did not have this kind of issue running Ubuntu 18.04 as OS. I see this behavior the first time since I started certspotter on Buster.

If you need any additional information please don't hesitate to ask. I'll do my best helping to solve this issue.

Tronde commented 4 years ago

Good Evening,
I've opened this issue one week ago. I would kindly ask if you already had the chance to review my report.

I would appreciate it, if you could find the time to look into it.

Thanks in advance for your help.

Regards,
Tronde

AGWA commented 4 years ago

certspotter is multi-threaded so to get a meaningful strace you need to include the -f option. Please re-run the strace with -f to see if any of the other threads are making progress.

You can also pass the -verbose option to certspotter to see what it is doing.

Tronde commented 4 years ago

Hi, and thanks for your answer.

So at first I straced the existing process with the -f option included. The related process structure shown by ps is:

tronde@host:~$ ps -fp 20273 -p 20274 -p 20275
UID        PID  PPID  C STIME TTY          TIME CMD
root     20273   691  0 Oct06 ?        00:00:00 /usr/sbin/CRON -f
tronde 20274 20273  0 Oct06 ?        00:00:00 /bin/sh -c /home/trondeng/go/bin/certspotter
tronde 20275 20274  4 Oct06 ?        12:08:48 /home/trondeng/go/bin/certspotter

You could find the strace output at https://gist.github.com/Tronde/8ca1b99535cf4bdb86ee3d905cb0c559

Please see the comment there which shows how the information were gathered.

In the next step I'm going to kill the process and start certspotter with the -verbose option. I'll come back to you when I have any results.

AGWA commented 4 years ago

Thanks for providing the additional information. The strace shows that certspotter is reading from file descriptor 5, which is a socket to a CT log.

CT logs have unfortunately grown extremely large over the last two years which means monitoring can take a very long time. If you let certspotter keep running, it will catch up eventually and terminate.

Alternatively, if you delete ~/.certspotter, then on the next run certspotter will start monitoring from the end of each log instead of from its previous position. If you keep running certspotter regularly, it should keep up. The downside is that you will miss certificates that were logged between the previous position and the new end of each log.

You may want to consider using the commercial offering of Cert Spotter instead. I made certspotter open source because I wanted it to be easy for individuals who aren't CT experts to monitor logs. Unfortunately, given the current scale of the CT ecosystem I'm not sure this is still a viable vision.

Tronde commented 4 years ago

I followed your suggestion and deleted ~/.certspotter. After recreating the watchlist the next run did finish after a short amount of time. I'll look into it, to see if it keeps that way the next few days.

You also suggested to use the commercial offering, but how does that work different from the version on my host? I did not see any bottleneck on my machine. Data from the first CT log came in with 4-5 Mbps and the host was close to idle.

I would like to understand what's the difference to the commercial service that makes it work faster. Could you explain it to me, please?

Anyway thanks for this great project. I use it to monitor the certificate for my private domain and really appreciate what you did here.

frh commented 4 years ago

I run certspotter once every 24 hours, and when I look at the runtime each run takes 7-21 hours, with around 14 hours being the median. There's still some room before the runtime becomes too long to complete. I don't know what the current bottleneck is. I'm guessing it could be sped significantly up by downloading each log in parallel.

Tronde commented 4 years ago

I would agree that it could speed things up when downloading all CT logs in paralell. Or at least some of them up to a certain number of logs.

@AGWA Would that be possible? I that a change you could implement in an upcoming release?

bllfr0g commented 4 years ago

I just submitted a PR that downloads the logs in parallel.

Note that I did not scrutinize the whole code base to ensure it is thread-safe; my change seems to work, but no guarantees.

AGWA commented 4 years ago

I've implemented parallel log processing in 86785d89d7f91c935b426e351f1fe38544476ce3.

I don't think there's anything more to be done with this issue, so I'm closing it.

Tronde commented 4 years ago

Hi @AGWA,
Thanks a lot for implementing the parallel log processing.

Do you already know when you are going to release version 0.10?