z_session_manager goes to timeout every 4-6 hours

GoogleCodeExporter commented 8 years ago

HTTP 500 occurs randomly on the server (4-core Xeon Server with 8Gb RAM).
Zotonic version: yesterday HG, default (0.7.x branch)

{error,
    {exit,
        {timeout,
            {gen_server,call,
                ['z_session_manager$hints',
                 {ensure_session,true,
                     {context,
                      <.........>}
        [{gen_server,call,2},
         {z_session_manager,ensure_session,1},
         {z_context,ensure_session,1},
         {z_context,ensure_all,1},
         {resource_hints_srv_1hello,to_html,2},
         {webmachine_resource,resource_call,3},
         {webmachine_resource,do,3},
         {webmachine_decision_core,decision,3}]}}

And, after several such errors z_session_manager is killed, and after that all 
things goes restarted by z_site_sup automatically.

=SUPERVISOR REPORT==== 17-Dec-2010::22:44:21 ===
     Supervisor: {<0.140.0>,z_site_sup}
     Context:    shutdown_error
     Reason:     killed
     Offender:   [{pid,<0.149.0>},
                  {name,z_session_manager},
                  {mfargs,
                      {z_session_manager,start_link,
                       ...}
                  {restart_type,permanent},
                  {shutdown,5000},
                  {child_type,worker}]
....

Full log for 11 hours of work is attached.

All of this causes a HTTP 500 error for all clients for 2-4 minutes.
Average server load by now - 1-3 requests/sec.
Average CPU load during this crash and during normal work = 0.5%, max = 5%.
Unique clients is >60% (may be session_managers's dict grows up to fast?).

By now, my question is that reasons may cause such behaviour(timeouts) and 
where are may be a bottlenecks?

ps: By now, I cannot reproduce problem manually, only during hours of load on 
the production server.

Original issue reported on code.google.com by hellla...@gmail.com on 18 Dec 2010 at 8:56

Attachments:

start_log_err2-f.bz2

GoogleCodeExporter commented 8 years ago

I will study this.  We have never seen this problem before, and we do have some 
quite populair sites (the same or higher load as yours).

We do keep sessions for 20+ minutes, so a lot of sessions might be problematic 
for a garbage collected dict.  But then I don't see why a garbage collect would 
take 2-4 minutes.

You are using the default branch? (0.7-dev)

Original comment by ma...@pobox.com on 19 Dec 2010 at 12:16

GoogleCodeExporter commented 8 years ago

Yes, i'm using 0.7-dev. 
I thinked about it last two days. 
I want to ask: does z_session depends on site-referrer?
By primary, our server generates css, js and provide background AJAX-like 
services for 3-rd party sites, so most of request to server is really not from 
my site, but from other sites. 90% traffic and load is a bunch of small 
JSON-answers.

Original comment by hellla...@gmail.com on 19 Dec 2010 at 12:37

GoogleCodeExporter commented 8 years ago

I wonder, besides that we have to fix this, does your server needs any sessions 
at all?

Original comment by ma...@pobox.com on 19 Dec 2010 at 12:42

GoogleCodeExporter commented 8 years ago

By now, i'm trying to remove sessions for 3-rd party requests (it was used for 
3-rd-party-site clients authentification).

Original comment by hellla...@gmail.com on 19 Dec 2010 at 12:49

GoogleCodeExporter commented 8 years ago

I detected that this same error occurs with very high probability in following 
conditions in real running project:

1) Normal requests going to zotonic server from clients, and responce is formed 
using server's database (e.g. serving pages with data from db).

2) Simulateously, there are some % of z_db:q1()-requests, that are refused by 
postgresql with error (e.g. bad sql or bad column name or other). Yes, of 
course, bad sql requests should not occur, but sometimes it is exists in my 
case :)

After some count of such errors per some amount of time, all database requests 
goes deadly locked for a very long time. Also, it causes the lock of 
z_session_manager because z_session_manager likes to ask persistenIDs from 
table `persistent`.

By now i will try to create and provide a usecase.

Original comment by hellla...@gmail.com on 13 Jan 2011 at 10:56

GoogleCodeExporter commented 8 years ago

Thanks for investigating further.

This looks like an issue we are seeing with some webhost where timeouts in 
postgresql give problems with the postgresql driver we are using.

Original comment by ma...@pobox.com on 13 Jan 2011 at 11:16

GoogleCodeExporter commented 8 years ago

Hi, profile.url. I detected, that z_db:q1() does not release the connection on 
errors because of case_clause (does not call function return_connectio())

Then i try bad z_db:q1() request with wrong column name:
{case_clause,
  {error,
    {error,error,<<"42703">>,
        <<"column \"added\" does not exist">>,
    []}}}

You can try to patch z_db.erl file to add some debug:
q1(Sql, Parameters, Context) ->
    case get_connection(Context) of
                none -> undefined;
                C ->
                    ?DEBUG(q1_got_connection),
                    V = case pgsql:equery1(C, Sql, Parameters) of
                            {ok, Value} -> Value;
                            {error, noresult} -> undefined
                        end,
                    ?DEBUG(q1_goes_to_return_connection),
                    return_connection(C, Context),
                    V   
        end.

So the out put should to be like (i think so):
q1_got_connection
--- error here ---
q1_goes_to_return_connection

But really it is:
q1_got_connection
--- error here ---
And nothing more.

Original comment by hellla...@gmail.com on 13 Jan 2011 at 11:34

GoogleCodeExporter commented 8 years ago

Hi,

The basic idea of error handling in the pgsql library is that the connection 
should be closed when an error occurs, as this (sometimes) leaves the 
connection in an indeterminate state.  

So the connection should crash and the pgsql pool should pick up the crash and 
close/remove the connection.

After your erroneous query, is the db connection still alive (from the postgres 
point of view)?

- Marc

Original comment by ma...@pobox.com on 13 Jan 2011 at 11:44

GoogleCodeExporter commented 8 years ago

Yes, Marc. Connection is left unreleased. In psql i'm calling this:
zotonic=# SELECT * FROM pg_catalog.pg_stat_activity;

Before error i see 7 rows (including several connections from pgAdmin), after 
some z_db:q1() errors i see 10 rows. And minute later i see 12, 13, 14, ... 
rows.

Original comment by hellla...@gmail.com on 13 Jan 2011 at 11:57

GoogleCodeExporter commented 8 years ago

Ok, ~8 hung connections from zotonic and server totally goes frozen :D

Original comment by hellla...@gmail.com on 13 Jan 2011 at 12:08

GoogleCodeExporter commented 8 years ago

Thanks, this we can use for debugging the connections.

Original comment by ma...@pobox.com on 13 Jan 2011 at 12:11

GoogleCodeExporter commented 8 years ago

I just committed a patch to release-0.6.x and default that might fix this issue.

Can you check with your install?

- Marc

Original comment by ma...@pobox.com on 13 Jan 2011 at 3:50

Changed state: Started
Added labels: Milestone-Release0.6, Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

Thank you for your quickness, Marc! I hope this crazy bug now gone away by 100%.

According to pg_catalog.pg_stat_activity the connections count does not leak 
after your bugfix. I can and will merge this changes to production servers only 
in nearest sunday-monday night. I will notify you about the progress next week 
(via email or via this comments).

Original comment by hellla...@gmail.com on 13 Jan 2011 at 4:27

GoogleCodeExporter commented 8 years ago

Mark, I'm not sure, but now i see similar bug on low-loaded production node 
after several days of load. Again, this is timeout with db connection. But it 
is more strange:

1. Postgres 9.0.1 is hot standby (read only). So, there are some periodic 
non-critical errors in zotonic log due to READ ONLY db.
2. Restarting zotonic several times DOES NOT fix the hung. So the restarted 
zotonic server still cannot work with data in DB.
3. Restarting postgres fixes the problem. And during restart I see a big fat 
bunch of dropped SELECT-transactions (see log attached).
4. During the postgres restart (3) I totally stopped zotonic, so all this 
transactions is "zombie".

I attached the two logs -- from postgres and from zotonic with some my 
comments. What do you think about this?

Original comment by hellla...@gmail.com on 22 Jan 2011 at 7:28

Attachments:

GoogleCodeExporter commented 8 years ago

It looks like the previous deadlock was only connected with too young postgres 
9.0.1 stand-by server. I've updated to 9.0.2 and it looks like the situation 
became much better.

Anyway, according to server logs, $this zotonic bug is fully fixed by now.

Original comment by hellla...@gmail.com on 27 Jan 2011 at 2:02

GoogleCodeExporter commented 8 years ago

Thanks for coming back, and great that the fixes do work!

I will direct our people to upgrade to 9.0.2 as well, will come in useful.

Original comment by ma...@pobox.com on 3 Feb 2011 at 9:07

Changed state: Verified

GoogleCodeExporter commented 8 years ago

Not for 9.0.2, but 9.0.3 :) According to pg-9.0.3 changelog some more 
standby-replica bugs are fixed too.

Original comment by hellla...@gmail.com on 3 Feb 2011 at 9:19

digideskio / zotonic

z_session_manager goes to timeout every 4-6 hours #129