cactus / go-camo

A secure image proxy server
MIT License
255 stars 48 forks source link

Eventually stops working #6

Closed peppy closed 10 years ago

peppy commented 10 years ago

Hi again! Performance is great, but occasionally the server will just stop serving images, giving "Error Fetching Resource" errors. RAM usage also seems to increase gradually in this state. Restarting the server fixes it, and this only happens after 12-24 hours of serving:

ClientsServed, BytesServed
825656, 28116134997

Here's the information I gathered:

Console output before I restarted https://gist.github.com/586b121c3fba40d8796b Memory usage vs traffic throughput: http://puu.sh/968Og/f48ce431fb.png Response headers of a failed request in this state: http://puu.sh/968QM/e0fb98fece.png

dropwhile commented 10 years ago

My offhand guess is that the Go-camo to internet keepalive is causing some issues. I will try to update the no-keepalive branch later today and see if that helps.

If it happens again in the meantime, can you capture some additional process info?

lsof -nP -p <pid of go-camo> | grep -c TCP
pmap -d <pid of go-camo> | tail -n 1

Thanks for the report!

dropwhile commented 10 years ago

@peppy I just made a branch "no-backend-keepalives". You could give that a try and see if it helps both the memory consumption and the "lock up".

That branch disables keepalives on the backend (so from Go-camo to imagur for example), and disables compression as well. Since these are images, compression doesn't help, likely isn't being used, and in the few cases where it may (svg) might not benefit much.

Let me know how it goes, and if do run into issues please try to collect as much data surrounding the "lock up" as you can.

Thanks!

peppy commented 10 years ago

I managed to hit the same issue in that branch, so we can rule that one out. Unfortunately it happened while I was half asleep and I didn't have the energy to look into things before restarting the server. Next time it happens I'll see what I can find.

dropwhile commented 10 years ago

Something else you can do, if you send the process a SIGUSR1 (kill -USR1 <go-camo-pid>), it will turn on debug logging. An additional SIGUSR1 will turn it off. You could try turning on debug mode for a second or two the next time it locks up, before you restart it, and see if the additional logging in the bad state can shed any light on the issue.

peppy commented 10 years ago

The recurring theme in debug output was:

2014/06/01 11:46:46 Could not connect to endpoint Get http://s03.flagcounter.com/count/5gIU/bg_FFFFFF/txt_000000/border_CCCCCC/columns_2/maxflags_20/viewers_0/labels_0/pageviews_0/flags_0/: dial tcp: i/o timeout
2014/06/01 11:46:46 Could not connect to endpoint Get http://puu.sh/5icHV/a7c41e408f.png: dial tcp: i/o timeout
2014/06/01 11:46:47 Could not connect to endpoint Get http://puu.sh/7XrLw.jpg: dial tcp: i/o timeout

I have a feeling there may be some kind of connection leak. Will provide more details next time, stupidly didn't do a netstat this time around (although lsof was sitting at around 32k entries).

dropwhile commented 10 years ago

hmm. I wonder if there is something goofy happening with the dns resolver (see old issue). If you are interested, I could build a version with -tags netgo (requires go libs to be recompiled as well). With -tags netgo Go will use an internal resolver for dns queries instead of spinning up a thread pool and using libc for resolution.

$ go install -a -tags netgo std # as root (or user that owns go install)
$ make build # go-camo builds with -tags netgo by default, but it only means anything if the above step was done

I can make a build from HEAD on x86_64 linux (centos 6.5) as above, if you want to try it without going through the trouble of mucking about with your go install.

peppy commented 10 years ago

Okay I've performed the above steps. Is there any way to confirm it's running using the modified version?

dropwhile commented 10 years ago

If you are using netgo (go native dns resolution) you should get similar output to the following:

 $ ldd go-camo 
      not a dynamic executable

If you are still using libc resolution, you will instead see something like this:

$ ldd go-camo 
    linux-vdso.so.1 =>  (0x00007fff98fff000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003fb2a00000)
    libc.so.6 => /lib64/libc.so.6 (0x0000003fb2600000)
    /lib64/ld-linux-x86-64.so.2 (0x0000003fb2200000)
peppy commented 10 years ago

Looking good :). We shall talk again in the near future.

dropwhile commented 10 years ago

good luck!

peppy commented 10 years ago

Things have been running very smoothly since the above patch. I'll let it go for another 24 hours to confirm the issue is actually fixed, but it's looking hopeful.

dropwhile commented 10 years ago

That's great news! Thanks for the update.

peppy commented 10 years ago

red areas are previous failures.

ClientsServed, BytesServed
2057453, 63810286707

60gb served, still going strong. I think this one is "solved".

You'd know better than me, but it seems like there is some kind of handle leak for DNS lookups? For the time being I'll leave this running with the native dns resolution (so this is a non-issue for now), but if you want to try and look into this further I'm happy to help run any tests should that be necessary.

I'd look into exactly why this is happening myself, but don't have the time :(. Thanks again for following up on this issue!

dropwhile commented 10 years ago

After a bit more research on the Go mailing list, I found this:

.. on most linux implementations, libc's dns resolver library uses select internally, and fails once the first 1024 file descriptors are in use. -- https://groups.google.com/d/msg/golang-nuts/KT2GBiyUzR0/1WXqRhhozYsJ

I am going to add some info to the readme about this issue, and that the use of netgo should be preferred.

Thanks for all your help in tracking it down @peppy. :)

dropwhile commented 10 years ago

closing as resolved.