Closed GoogleCodeExporter closed 8 years ago
I will study this. We have never seen this problem before, and we do have some
quite populair sites (the same or higher load as yours).
We do keep sessions for 20+ minutes, so a lot of sessions might be problematic
for a garbage collected dict. But then I don't see why a garbage collect would
take 2-4 minutes.
You are using the default branch? (0.7-dev)
Original comment by ma...@pobox.com
on 19 Dec 2010 at 12:16
Yes, i'm using 0.7-dev.
I thinked about it last two days.
I want to ask: does z_session depends on site-referrer?
By primary, our server generates css, js and provide background AJAX-like
services for 3-rd party sites, so most of request to server is really not from
my site, but from other sites. 90% traffic and load is a bunch of small
JSON-answers.
Original comment by hellla...@gmail.com
on 19 Dec 2010 at 12:37
I wonder, besides that we have to fix this, does your server needs any sessions
at all?
Original comment by ma...@pobox.com
on 19 Dec 2010 at 12:42
By now, i'm trying to remove sessions for 3-rd party requests (it was used for
3-rd-party-site clients authentification).
Original comment by hellla...@gmail.com
on 19 Dec 2010 at 12:49
I detected that this same error occurs with very high probability in following
conditions in real running project:
1) Normal requests going to zotonic server from clients, and responce is formed
using server's database (e.g. serving pages with data from db).
2) Simulateously, there are some % of z_db:q1()-requests, that are refused by
postgresql with error (e.g. bad sql or bad column name or other). Yes, of
course, bad sql requests should not occur, but sometimes it is exists in my
case :)
After some count of such errors per some amount of time, all database requests
goes deadly locked for a very long time. Also, it causes the lock of
z_session_manager because z_session_manager likes to ask persistenIDs from
table `persistent`.
By now i will try to create and provide a usecase.
Original comment by hellla...@gmail.com
on 13 Jan 2011 at 10:56
Thanks for investigating further.
This looks like an issue we are seeing with some webhost where timeouts in
postgresql give problems with the postgresql driver we are using.
Original comment by ma...@pobox.com
on 13 Jan 2011 at 11:16
Hi, profile.url. I detected, that z_db:q1() does not release the connection on
errors because of case_clause (does not call function return_connectio())
Then i try bad z_db:q1() request with wrong column name:
{case_clause,
{error,
{error,error,<<"42703">>,
<<"column \"added\" does not exist">>,
[]}}}
You can try to patch z_db.erl file to add some debug:
q1(Sql, Parameters, Context) ->
case get_connection(Context) of
none -> undefined;
C ->
?DEBUG(q1_got_connection),
V = case pgsql:equery1(C, Sql, Parameters) of
{ok, Value} -> Value;
{error, noresult} -> undefined
end,
?DEBUG(q1_goes_to_return_connection),
return_connection(C, Context),
V
end.
So the out put should to be like (i think so):
q1_got_connection
--- error here ---
q1_goes_to_return_connection
But really it is:
q1_got_connection
--- error here ---
And nothing more.
Original comment by hellla...@gmail.com
on 13 Jan 2011 at 11:34
Hi,
The basic idea of error handling in the pgsql library is that the connection
should be closed when an error occurs, as this (sometimes) leaves the
connection in an indeterminate state.
So the connection should crash and the pgsql pool should pick up the crash and
close/remove the connection.
After your erroneous query, is the db connection still alive (from the postgres
point of view)?
- Marc
Original comment by ma...@pobox.com
on 13 Jan 2011 at 11:44
Yes, Marc. Connection is left unreleased. In psql i'm calling this:
zotonic=# SELECT * FROM pg_catalog.pg_stat_activity;
Before error i see 7 rows (including several connections from pgAdmin), after
some z_db:q1() errors i see 10 rows. And minute later i see 12, 13, 14, ...
rows.
Original comment by hellla...@gmail.com
on 13 Jan 2011 at 11:57
Ok, ~8 hung connections from zotonic and server totally goes frozen :D
Original comment by hellla...@gmail.com
on 13 Jan 2011 at 12:08
Thanks, this we can use for debugging the connections.
Original comment by ma...@pobox.com
on 13 Jan 2011 at 12:11
I just committed a patch to release-0.6.x and default that might fix this issue.
Can you check with your install?
- Marc
Original comment by ma...@pobox.com
on 13 Jan 2011 at 3:50
Thank you for your quickness, Marc! I hope this crazy bug now gone away by 100%.
According to pg_catalog.pg_stat_activity the connections count does not leak
after your bugfix. I can and will merge this changes to production servers only
in nearest sunday-monday night. I will notify you about the progress next week
(via email or via this comments).
Original comment by hellla...@gmail.com
on 13 Jan 2011 at 4:27
Mark, I'm not sure, but now i see similar bug on low-loaded production node
after several days of load. Again, this is timeout with db connection. But it
is more strange:
1. Postgres 9.0.1 is hot standby (read only). So, there are some periodic
non-critical errors in zotonic log due to READ ONLY db.
2. Restarting zotonic several times DOES NOT fix the hung. So the restarted
zotonic server still cannot work with data in DB.
3. Restarting postgres fixes the problem. And during restart I see a big fat
bunch of dropped SELECT-transactions (see log attached).
4. During the postgres restart (3) I totally stopped zotonic, so all this
transactions is "zombie".
I attached the two logs -- from postgres and from zotonic with some my
comments. What do you think about this?
Original comment by hellla...@gmail.com
on 22 Jan 2011 at 7:28
Attachments:
It looks like the previous deadlock was only connected with too young postgres
9.0.1 stand-by server. I've updated to 9.0.2 and it looks like the situation
became much better.
Anyway, according to server logs, $this zotonic bug is fully fixed by now.
Original comment by hellla...@gmail.com
on 27 Jan 2011 at 2:02
Thanks for coming back, and great that the fixes do work!
I will direct our people to upgrade to 9.0.2 as well, will come in useful.
Original comment by ma...@pobox.com
on 3 Feb 2011 at 9:07
Not for 9.0.2, but 9.0.3 :) According to pg-9.0.3 changelog some more
standby-replica bugs are fixed too.
Original comment by hellla...@gmail.com
on 3 Feb 2011 at 9:19
Original issue reported on code.google.com by
hellla...@gmail.com
on 18 Dec 2010 at 8:56Attachments: