Closed florianm closed 7 years ago
Hi @florianm, thanks for the thorough report.
I can't replicate this with 1500 datasets so just the number of them is not the issue. Some things to try / report:
package_show
API call also take that long? This would discard template rendering issuesWhat seems off to me is that you see degraded performance on the dataset page and the homepage with the org/group widgets, which would suggest a problem with the search. But then you mention that the dataset list works fine. Does the orgnanization/group page (the ones that lists their datasets) work fine?
This is very unusual, would you enable profiling and attach the report?
@wardi @amercader will toggle extensions until I find minimum problem set and enable profiling.
I most likely have introduced this by pulling latest CKAN master into the datacats CKAN up until a few months ago.
Apologies for the delay, 'tis the flu season. I haven't cracked the problem yet, but here are some random observations:
package_show
.I'm now running CKAN 2.5.3 using datacats and getting
Error while starting web container:
/usr/lib/ckan/local/lib/python2.7/site-packages/sqlalchemy/sql/default_comparator.py:33: SAWarning: The IN-predicate on "group.name" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate. Consider alternative strategies for improved performance.
return o[0](self, self.expr, op, *(other + o[1:]), **kwargs)
Debug at: http://0.0.0.0:5000/_debug/view/1478506946
With CKAN 2.5.3, the dataset pages still load slowly.
@amercader you're right - api/3/action/package_show?id=...
loads instantly, /api/3/action/group_show?id=mpa-reporting
loads slowly. Even after removing all groups from a dataset, the dataset page (GUI) still loads slowly.
It seems we've got a fair amount of revisions and activities which simply bog our instance down. We're updating ~600 datasets annually multiple times (using R scripts, see video).
In addition, the SQL behind the activity stream fetches all records, then limits the output to the limit as per CKAN config.
Maybe related, CKAN seems to rollback the select queries run when displaying dataset / group etc. pages.
The performance has improved from ~ 20 sec TTFB to ~ 3 sec TTFB (while vacuum analyze still runs) on dataset pages by deleting old revisions and activities. Also some btree indices seem to help (acvtivity shown, revision not shown):
delete from revision where timestamp < '2016-11-01';
DELETE 2693688
select count(*) from activity;
count
--------
743876
(1 row)
ckan=# delete from activity where timestamp < '2016-08-01';
DELETE 572303
vacuum analyze;
create index idx_activity_id on activity using btree(id);
create index idx_activity_timestamp on activity using btree(timestamp);
create index idx_activity_userid on activity using btree(user_id);
create index idx_activity_objectid on activity using btree(object_id);
create index idx_activity_revisionid on activity using btree(revision_id);
So it looks like there's some potential to streamline queries and indices, or cleaning out activities and revisions occasionally.
The slow performance I'm reporting was caused by having 2.6 million revisions and 740k activities, and improved through truncating these tables as shown above.
@florianm does the revision DELETE casade to all the *_revision tables? If so that's much simpler than the cleaning-up-revisions script from data.gov.uk.
@TkTech have you pushed your recent work on the activity table and query up somewhere? @florianm's 3s is still way too slow, I suspect your change might fix this one.
I think #2959 was the last activity-performance-fixing PR
@wardi I've applied some tuning and got rid of some indices and FK constraints (many involving _revision tables). My delete was non-cascading, and I'm not sure whether without my deleted FK constraints CASCADE would have dropped related revisions as well. Maybe cleaning out those other _revision tables might speed up my 3 sec dataset page load a bit more.
As my performance problem is fixed for now, and we're working on a new (sadly non-datacats) CKAN deployment, I'm happy to test @TkTech's patches and looking forward to seeing them merged to master.
@florianm is this still an issue for you? I'm currently testing with 15m datasets and a few hundred million revisions and not running into this particular issue. It's possible that one of your plugins is responsible for the performance drop.
@TkTech yes, but I'm not using the latest CKAN (see above), as I'm running a datacats setup. My dataset list loads reasonably fast, but individual dataset pages can take a while. I'm closing this issue as I haven't reproduced it with latest master. @amercader @wardi @TkTech thanks for your input!
TLDR: this issue documents how to monitor postgres queries and identify performance bottlenecks in sql / model / content.
Master recently merged a few of my PRs to improve performance, in particular with activities. Glad it worked! We have an ongoing epic to fully support 15m datasets at a minimum.
@TkTech looking forward to battletest when I'm migrating from datacats to CKAN's native docker :-) Epic will be needed when people are programmatically updating their resources, e.g. from https://github.com/datawagovau/ckan-o-sweave
@wardi Sorry for necroing an old issue, but you mention a script for data.gov.uk to clean up old revisions - Would you happen to have a link (I've been unable to locate it)
@MadsRCEnerginet see my comment above - the delete from revision where timestamp < 'YYYY-MM-DD;
bit did the trick.
@florianm - I tried the same approach a few weeks back, however I ran into an issue where a lot of the entries had references that prevented a delete from happening. Thus I'd have to manually delete references, which with the current revision system is... problematic and slow. So I was searching for a premade script to handle this before requesting time to create one myself from my team.
@MadsRCEnerginet aw shoot, I suppose a cascade
won't help either or destroy the wrong data, right?
Sorry, that's all I can remember about this, good luck!
I think it would help, but, as you mentioned, could destroy the wrong data... Never really been a fan of running cascade
on a schema I'm not 110% comfortable with :dancing_men:
Thanks for reaching out though :+1:
No worries, and cascades are only half as scary with good backups :-D
@davidread do you still have a copy of the revision-trimming sql data.gov.uk has used?
@MadsRCEnerginet The code I can track down was when I had an issue was with resource_revisions. I had to drop constraints. Something like this:
BEGIN;
ALTER TABLE package_tag DROP CONSTRAINT package_tag_revision_id_fkey;
ALTER TABLE package_extra DROP CONSTRAINT package_extra_revision_id_fkey;
ALTER TABLE resource DROP CONSTRAINT resource_revision_id_fkey;
# now drop resource_revisions
# e.g. DELETE from resource_revision where id='x' and revision_id='y';
ALTER TABLE package_tag ADD CONSTRAINT package_tag_revision_id_fkey FOREIGN KEY (revision_id) REFERENCES revision(id);
ALTER TABLE package_extra ADD CONSTRAINT package_extra_revision_id_fkey FOREIGN KEY (revision_id) REFERENCES revision(id);
ALTER TABLE resource ADD CONSTRAINT resource_revision_id_fkey FOREIGN KEY (revision_id) REFERENCES revision(id);
COMMIT;
See: https://github.com/datagovuk/ckanext-dgu/blob/master/ckanext/dgu/bin/wms_revisions.py
@davidread Thanks! That should point me in the correct direction!
CKAN Version if known (or site URL)
(first mentioned in #3260) My setup is datacats & CKAN 2.6.0a (docs).
/api/util/status says
We've got ~1200 datasets in that instance.
Please describe the expected behaviour
Please describe the actual behaviour
We've had a period of lower usage, where no one reported any bottlenecks (or bottlenecks went unnoticed). In that time I upgraded the CKAN instance from 2.5 to 2.6.0a but I haven't got enough logs to correlate the drop in performance to a specific upgrade.
I've got a slightly non-standard setup:
Datacats generates its own passwords, to export the postgres password, run the output of
Then activate
datacats shell
. Non-datacats setups will require their own host, database name, and username in the next line (modified from Aleksey Tsalolikhin):which monitors current db queries. Here I found the following query running a few times at every dataset page load, exactly during the 20 sec TTFB wait (update: formatting):
What steps can be taken to reproduce the issue?
Next I'll create an index on revision(author) and see whether that helps. But I cannot imagine that I'm the only one with this problem if that string lookup on revision.author were the true blocker.
If that won't help I'll create a completely new CKAN instance and migrate the data. In the mean time I'm open to suggestions where to look/debug/test next. Anyone seen similar degraded performance lately?