Kozea / Radicale

A simple CalDAV (calendar) and CardDAV (contact) server.
https://radicale.org
GNU General Public License v3.0
3.33k stars 430 forks source link

Radicale repeatedly hanging / locking #266

Closed TheFabe closed 7 years ago

TheFabe commented 9 years ago

I am runnning radicale 0.10 and the radicale process is repeatedly hanging after a differing time span. I have clients on iOS and Android (davdroid) and Thunderbird with Lightning accessing it for calendars and addressbooks. After some time radicale does not react any more and clients report problems when trying to sync. sudo lsof -c radicale -a -i lists not only the listening socket on my radicale port, but also a socket in status ESTABLISHED in cases when radicale is "hanging". pkill radicale will not help, only pkill -9 radicale will be sufficient. Because I have enabled port-forwarding from my router to my raspi in several cases I saw there as foreign IP probably of one of "my" devices when they last were "in the wild" (my GSM-providers name space). The iOS and Android devices are accessing radicale from my internal WLAN and also from the Internet.

Can there be a problem with radicale not "detecting" a client "just dropping dead"?

Is is correct that the "timeout" is disabled in this simple setup using "deamon = True" because of serve_forever?

Can I set a special timeout for the TCP-level or ssl-negotiations ?

radicale is running as "standalone" with daemon = True in the config file and is launched in rc.local. I use radicale on a raspi under raspian, and I have not set up a "real" HTTP-server like nginx or apache to save performance. I have found the following message http://librelist.com/browser//radicale/2014/3/2/bug-in-radicale-0-8-locks-up/ and I am wondering whether changing to nginx will really help me.

liZe commented 9 years ago

I have found the following message http://librelist.com/browser//radicale/2014/3/2/bug-in-radicale-0-8-locks-up/ and I am wondering whether changing to nginx will really help me.

Yes. I've never found why the Python server hangs, but I'm sure that the problem doesn't exist with a "real" HTTP server. By the way, I'd be really happy to know what's the cause of the problem, and if there's a way to fix it.

hieronymousch commented 9 years ago

Same here when running the python server. No indications in logfiles why it hangs. PID file remains and proces seems running.

TheFabe commented 9 years ago

My current guess it that the problem is linked to the availability of radicale to the internet. I assume that radicale can only handle one request after another, because it starts a single python-thread to handle the requests? Can someone confirm this?

If this is the case, because radicale uses ssl.wrap_socket and has removed the timeout, there could be a hang if someone (script-kiddy?) connects to my radicale and opens a TCP-session and then silently "drops dead" at some stage in the SSL-handshake of key-exchange or something and the "simple" ssl.wrap_socket waits forever to complete the session handshake before passing the "core"-request to radicale?

This could explain why I see an open session to some strange IP in lsof....

Can you comment on this idea?

TheFabe commented 9 years ago

I have removed the port-forwarding from my router two days ago, and my radicale seems more stable. This could be then, because only "well behaving" clients connect to it now via the LAN and WLAN.

@hieronymousch: Have you also exposed your radicale to the internet?

hieronymousch commented 9 years ago

Hi,

Mine was also connected to the Internet. I put it now behind Apache, hope this will make it more stable.

J Op 6 mrt. 2015 19:30 schreef "TheFabe" notifications@github.com:

I have removed the port-forwarding from my router two days ago, and my radicale seems more stable. This could be then, because only "well behaving" clients connect to it now via the LAN and WLAN.

@hieronymousch https://github.com/hieronymousch: Have you also exposed your radicale to the internet?

— Reply to this email directly or view it on GitHub https://github.com/Kozea/Radicale/issues/266#issuecomment-77609969.

halo commented 9 years ago

My current guess it that the problem is linked to the availability of radicale to the internet.

Mine is solely used on the Intranet (without SSL just for testing) and I have the same issue. I'm running it on a Raspberry Pi 2 using Minibian and it happens when I run radicale in the foreground. I will go for nginx now to see if it helps.

untitaker commented 9 years ago

I think this is because the server is singlethreaded and Radicale is hanging itself up on connections which might have been kept open by the client (intentionally for reuse).

deronnax commented 9 years ago

Radicale behind nginx + uwsgi. No hanging problem.

svrnwnsch commented 9 years ago

I am having the same problem. How can I stop the radicale daemon safe to restart it every day?

daks commented 9 years ago

Radicale behing nginx+gunicorn (managed by supervisord). No problem.

tmst commented 9 years ago

I've been seeing this for some time, now. At first I suspected it had to do with the log file growing without bounds, but then I noticed that the server would be unresponsive even with a newly rotated logfile. I'm not sure how the logfile problem might have affected the server, but I do know that it brought the entire system to a halt when the disk space ran out.

Anyway, it would be good to know what's going on, here. I'd rather not configure another server (nginx) if I don't have to. If it's the case that a session left open is stopping this, could it be closed after a certain time? It probably wouldn't work to kill any existing thread when a client tries to connect, as it might kill an active transfer (?) and leave the server or client in an undesirable state.

So the solution would be to close any session left open longer than a certain time as measured from the initial connection time. I can't think of any other way to do that than by forking another process that has a handle to the thread and kills it at a certain time and then exits. Any interesting ideas?

halo commented 9 years ago

I gather from the comments that nobody using a webserver has the problem. But is it consistent behavior that spinning up the python daemon causes the hang every time with two clients even without SSL? Does it only happen on Raspberry PIs?

I really wish that we could achieve what the website says:

Works out-of-the-box, no installation nor configuration required

Because it's not everyone's cup of tea to setup uwsgi etc. Especially if people just want to quickly try it out to see how robust it is compared to similar tools programmed in other languages (which was how I ran into the problem).

EDIT @tmst The disk space issue is known from #106

tmst commented 9 years ago

@halo Yes, the dreaded "IOError: [Errno 28] No space left on device". I think I finally figured out how to configure logrotate so it just truncates the logfile in place and doesn't require restarting the server. Something about the restart was wrong as I wasn't sure how to get the sudo command correct.

chris5560 commented 9 years ago

Here my logging config so Radicale do logrotate by itself:

[loggers]
keys = root

[handlers]
keys = console,file,syslog

[formatters]
keys = simple,full,syslog

[logger_root]
level = DEBUG
handlers = console,file,syslog

[handler_console]
level = WARNING
class = StreamHandler
args = (sys.stdout,)
formatter = simple

# HERE IS THE MAGIC 3files a 32k in size
[handler_file]
args = ('/var/log/radicale/radicale','a',32768,3)
level = INFO
class = handlers.RotatingFileHandler
formatter = full

[handler_syslog]
level = WARNING
class = handlers.SysLogHandler
args  = ('/dev/log', handlers.SysLogHandler.LOG_DAEMON)
formatter = syslog

[formatter_simple]
format = %(message)s

[formatter_full]
format = %(asctime)s - %(levelname)s: %(message)s

[formatter_syslog]
format = radicale [%(process)d]: %(message)s

Better the using logrotate.

TheFabe commented 9 years ago

I am quite sure the problem is not really caused by the "not enough space"-issue, even if this also will bring the processing to a halt. I repeatedly experienced the hanging although I have never had any "out of space issues". Since I moved my radicale "behind" an nginx-server I have never had this problem again. I still restart my radicale every few days, so I can make sure no changes are lost while I make a backup. This stopping and starting never had any issues. The "difficult part" of setting up nginx was creating the "radicale"-site in the sites-enabled directory. I created a specifiy passwd-file for this and made several "location" directives so I can setup different auth_basic "realm" for authentication.

snippet from radicale file in /etc/nginx/sites-enabled server { listen 12345; ## listen for ipv4; this line is default and implied (port I use for clients with SSL)

    ssl on;
    # I share my certificate with my dovecot running in the same raspi host
    ssl_certificate /etc/dovecot/dovecot.pem;
    ssl_certificate_key /etc/dovecot/private/dovecot.pem;

    ssl_session_timeout 5m;

    ssl_protocols SSLv3 TLSv1;
    ssl_ciphers ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv3:+EXP;
    ssl_prefer_server_ciphers on;

    root /usr/share/nginx/www;
    index index.html index.htm;

    # Make site accessible from http://localhost/
    server_name localhost;

    location / {
            auth_basic "Radicale - Password Required";
            auth_basic_user_file radicale.passwd;
            proxy_pass http://localhost:5432;
            # This is the port radicale will "listen" on internally
    }
    location /user1 {
            auth_basic "Radicale for user1";
            auth_basic_user_file radicale.passwd;
            proxy_pass http://localhost:5432;
    }
    location /user2 {
            auth_basic "Radicale for user2";
            auth_basic_user_file radicale.passwd;
            proxy_pass http://localhost:5232;
    }

} I only have a handful of users, and one "shared" user with well-known credentials.

TheFabe commented 9 years ago

Sorry, closing this was not what I wanted...

tmst commented 9 years ago

On Tuesday 14 July 2015 01:50:23 AM Christian Schoenebeck wrote:

Here my logging config so Radicale do logrotate by itself:

Interesting. I would have never guessed: args = ('/var/log/radicale/radicale','a',32768,3)

How did you determine the semantics "()"?

chris5560 commented 9 years ago

Try and Error ;-)

halo commented 9 years ago

Please keep the logfile stuff to the logfile issue, those people will seek for answers over there, not here :)

So nobody could confirm my questions about a pattern for when the locking occurs? Does it always lock everywhere as soon as multiple clients are involved and radicale is not behind a proxy?

If yes, then the in-built python server is useless (but I suspect more people would have this problem if it would always occur). If the answer is no then I'm wondering whether it could be related to Rasbian/Minibian. Maybe python is too old there?

(And I don't mean the out-of-disk issue. I would almost expect it to hang in that case, plus it has been confirmed that this is not related to the hanging described in this thread)

tmst commented 9 years ago

@halo Sorry. I haven't discovered a pattern for the hanging. I'm connecting from several devices (Android running CalDAV-sync, KDE, and iCal on Mac), both over the local network and the Internet using a no-ip hostname updated by my DSL modem's software.

When it all works, it's sweet. But there are way too many points of potential failure.

halo commented 9 years ago

Well, I'm currently spawning Radicale with Passenger (nginx) and didn't experience any hanging yet. But what I'm really after (given we cannot figure out any pattern) is whether the out-of-the-box server works totally problem-free at all for someone?

If it does, we're just unlucky (and more unlucky people might follow our lead, hah). Plus I'd like to hear one lucky person speak up and confirm that it does work at all times :)

If it does not, one might even consider ripping it out of Radicale (think how simple the configuration would be without TLS) and everyone should be forced to set up something like uwsgi to increase general reliability. At the least, the documentation should make super clear that that's not a working setup.

liZe commented 9 years ago

I gather from the comments that nobody using a webserver has the problem. But is it consistent behavior that spinning up the python daemon causes the hang every time with two clients even without SSL? Does it only happen on Raspberry PIs?

From my long experience using Radicale:

A lot of people have been complaining about this Python HTTP server hanging, with Radicale but with other projects too. I tried hard to fix this bug, but unfortunately I didn't even find out how to reproduce it.

I'd be really happy to see this bug fixed, but I personally gave up a long time ago.

I really wish that we could achieve what the website says:

Works out-of-the-box, no installation nor configuration required

Because it's not everyone's cup of tea to setup uwsgi etc. Especially if people just want to quickly try it out to see how robust it is compared to similar tools programmed in other languages (which was how I ran into the problem).

If anyone needs help, I'll be happy to give hints.

daks commented 9 years ago

My point of view (but I may be wrong) is that it should be behind a dedicated web server.

I first tested Radicale with the builtin server, with few clients and configurations. But once I wanted to put it in 'production' (for family use only) I automatically put it behind nginx+gunicorn (as I do for other Python apps like Flask or Django). But it's right that I did it because I know how to do it, but it's not the case of every potential Radicale user.

Maybe the documentation should indicate that for testing it's ok to run the process alone but for real use the setup is more complex. It could also be possible to include some files (wsgi, gunicorn...) to ease deployment.

I can share my setup if needed.

flusi100 commented 8 years ago

I was playing with radicale and put it in production for my familiy with the default server. I only used it in my private network, so it was no security issue. After hours of wondering, what was going on (logs don't help here) I found this bug. I regret to put trust in the already mentioned statement of the first page: easy to install. Now I have to bother with nginx and wsgi, what I wanted to avoid for my setup. After some hours of searching the web for a working configuration (wide spread ubuntu server) I still have no working setup. Really frustrating. For me it now gets very complicated and I think of using another solution, because bringing up the feature rich sogo was easier...

deronnax commented 8 years ago

Hi. I'm not related to Radicale. I am very sorry for what you're experiencing. You can replace Nginx and uwsgi with Gunicorn, which is much, much easier to configure (it almost works out of the box).

@liZe : that hanging problem seems to be a really big problem a lot of people encouter. Maybe the Radicale web page should recommend a default deployment using Gunicorn. What do you think ?

deronnax commented 8 years ago

maybe we (me not included since I'm not working for the next 6 month) should try to write a test stressing a lot the server that produce the hand. It would be a huge step forward to fix this: having a 100% reproductible case.

untitaker commented 8 years ago

I think this issue would/will be solved when Radicale's backends become free of data races. Then you can just use gunicorn or whatever.

liZe commented 8 years ago

Having a way to reproduce the bug IS the real solution, but I've already spent hours (days?) trying to find why it hangs. Does anybody know a way to find what's going on when it hangs?

untitaker commented 8 years ago

@deronnax You must not just replace the builtin server with Gunicorn, this leads to data races! See #276.

My guess is that some client leaves open a connection and thereefore blocks further requests. Not really a bug anywhere IMO.

EDIT: Here's a script that will hang up the server for a few hours: https://gist.github.com/untitaker/748b852906245c40c473

flusi100 commented 8 years ago

I really would like to stay with radicale, but this bug makes it hard for me (and most probably for many others). If it cannot be fixed, maybe a readme.txt for typical setups would help. Other projects could be a role model.

danmcd commented 8 years ago

If you run it on illumos (or even Oracle Solaris) the pstack(1) command is your new best friend ("pstack pgrep -f radicale"). Mine locks, and is NOT open to the public Internet. It hangs in read() on a socket. If a device (typically phone, sometimes laptop) goes off the net before radicale thinks it's done, it just hangs in read() forever.

I currently hack around this with a cron(1) job that checks pstack every five minutes, and if it's in read(), it kills -9 radicale. I do not run it behind a real web server, but I suspect the read() hang has to do with the Python web server, as others have noted. Next time I get a hit from my cron jobs, I'll share the pstack output here.

untitaker commented 8 years ago

What would be interesting is: If Radicale hangs, what happens if you shut down (really, shut down) all possible clients? This would help identify whether Radicale hangs up over a non-existent connection, or if clients leave one open for further use (which does hang up Radicale, which is already a know issue).

danmcd commented 8 years ago

ALL clients? The locks I'm seeing are on a read to a specific open connection. I can imagine shutting down (as in powering off) the client in question will send FIN or RST if its side of the connection exists and is open, but that seems like an unnecessary burden on the client(s).

untitaker commented 8 years ago

@danmcd If it does close the connection properly, we can conclude that neither side has bugs:

liZe commented 8 years ago

@deronnax You must not just replace the builtin server with Gunicorn, this leads to data races! See #276.

That's true, tried here, no problem during one year, and then suddenly one calendar disappeared from the disk. When you read the code and see how write works, it's really simple to understand how a race condition can happen.

If it does close the connection properly, we can conclude that neither side has bugs.

True. We could set a configurable timeout on the socket as a quick and dirty solution, what do you think?

danmcd commented 8 years ago

A modest timeout on read() would be most useful, I think.

untitaker commented 8 years ago

@liZe We could also make Radicale free of data races. E.g. we could have a lock for the complete server, and acquire it for each actual request (instead of open connection, which isn't exposed in WSGI anyway). Then you can use gunicorn to safely work around this issue.

For this, we need https://pypi.python.org/pypi/filelock/ (threadlocks will not suffice because gunicorn spawns multiple processes).

liZe commented 8 years ago

@untitaker Yes, I've already seen this project, looks like a very good solution.

By the way, we need a timeout even with a multithread/multiprocess solution.

untitaker commented 8 years ago

@liZe Yes, gunicorn already provides all of that. I know you don't want to introduce any dependencies, but I don't think the alternative of vendoring everything is viable.

You could also bundle both filelock and some pure-Python server like e.g. waitress.

liZe commented 8 years ago

@untitaker I'm now open to good ideas, even with dependencies: http://librelist.com/browser//radicale/2015/8/21/radicale-1-0-is-coming-what-s-next/

I'll open a ticket to talk about this.

liZe commented 8 years ago

See #364.

eatdust commented 8 years ago

Hi there, radicale 1.1.1 on mageia 5. Hangs at least once every 24 hours, I am considering changing to another caldav server as it makes it unusable for me professionally. Hope I can help. cheers.

liZe commented 8 years ago

@eatdust As it is explained in this ticket, using the monothread server built in Python will always lead to hangs AFAIK. The only reliable solution for now is to put Radicale behind a "real" HTTP server (I use nginx, but it's compatible with all the WSGI-compatible servers, including the Python-based ones).

tmst commented 8 years ago

I think you intended to put a link to the thread in here. Yes?

Say, I wonder how many of us are using Radicale behind a KDE client. I've had considerably less (or no) trouble after switching to Thunderbird. I'm still using Radicale 0.7.

On 05/19/2016 06:59 AM, Guillaume Ayoub wrote:

@eatdust https://github.com/eatdust As it is explained in this ticket, using the monothread server built in Python will always lead to hangs AFAIK. The only reliable solution for now is to put Radicale behind a "real" HTTP server (I use nginx, but it's compatible with all the WSGI-compatible servers, including the Python-based ones).

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/Kozea/Radicale/issues/266#issuecomment-220332618

untitaker commented 8 years ago

It is the thread you are replying to.

On 20 May 2016 08:26:54 CEST, Tom Russell notifications@github.com wrote:

I think you intended to put a link to the thread in here. Yes?

Say, I wonder how many of us are using Radicale behind a KDE client. I've had considerably less (or no) trouble after switching to Thunderbird. I'm still using Radicale 0.7.

On 05/19/2016 06:59 AM, Guillaume Ayoub wrote:

@eatdust https://github.com/eatdust As it is explained in this ticket, using the monothread server built in Python will always lead to hangs AFAIK. The only reliable solution for now is to put Radicale

behind a "real" HTTP server (I use nginx, but it's compatible with all the WSGI-compatible servers, including the Python-based ones).

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/Kozea/Radicale/issues/266#issuecomment-220332618


You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/Kozea/Radicale/issues/266#issuecomment-220527403

Sent from my Android device with K-9 Mail. Please excuse my brevity.

eatdust commented 8 years ago

@liZe Got it. Indeed, I now running it under nginx with just proxying and it seems to work fine. Sorry about this, maybe this should be explained in the doc. As it was my first attempt to use radicale, I have just followed the doc to set the standalone python server. Maybe the doc should have instead a tutorial to orient new users for setting up radicale behind a http server. Thanks.

liZe commented 7 years ago

No that Radicale is data-race free, I think that we can safely close this issue. See #197 and #372 for the documentation part.

tmst commented 7 years ago

Have been running the naked server under OpenSUSE and Python2 for 2 years now and it's been solid. I use 0.7.1. As it has no issues, I've never seen a reason to upgrade. It's accessible only on the local network.

On Sat, Mar 4, 2017, at 06:43 AM, Guillaume Ayoub wrote:

Closed #266[1].

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub[2], or mute the thread[3].>

Links:

  1. https://github.com/Kozea/Radicale/issues/266
  2. https://github.com/Kozea/Radicale/issues/266#event-986463153
  3. https://github.com/notifications/unsubscribe-auth/AAzjDq6P0KUT_LilSD_ZsiVWts9Nf20nks5riWp8gaJpZM4Dnhu-
a3nm commented 7 years ago

Some more feedback about this issue: I worked around the problem by running Radicale behind Apache2 as a WSGI. There doesn't seem to be any more problems when doing this.

tmst commented 7 years ago

Oops. I forgot that I had set up a cron job that restarts the server every day. So I don't know how often it actually locks up. On Thu, Apr 20, 2017, at 11:18 AM, Tom Russell wrote:

Have been running the naked server under OpenSUSE and Python2 for 2 years now and it's been solid. I use 0.7.1. As it has no issues, I've never seen a reason to upgrade. It's accessible only on the local network.> On Sat, Mar 4, 2017, at 06:43 AM, Guillaume Ayoub wrote:

Closed #266[1].

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub[2], or mute the thread[3].>>

Links:

  1. https://github.com/Kozea/Radicale/issues/266
  2. https://github.com/Kozea/Radicale/issues/266#event-986463153
  3. https://github.com/notifications/unsubscribe-auth/AAzjDq6P0KUT_LilSD_ZsiVWts9Nf20nks5riWp8gaJpZM4Dnhu-