Slow dataset page load waiting on revision query

florianm commented 8 years ago

CKAN Version if known (or site URL)

(first mentioned in #3260) My setup is datacats & CKAN 2.6.0a (docs).

/api/util/status says

{
"ckan_version": "2.6.0a",
"site_url": "xxx",
"site_description": "",
"site_title": "xxx",
"error_emails_to": "xxx",
"locale_default": "en_AU",
"extensions": [
"cesium_viewer",
"resource_proxy",
"datastore",
"datapusher",
"datawagovau_theme",
"stats",
"archiver",
"qa",
"featuredviews",
"showcase",
"disqus",
"scheming_datasets",
"recline_grid_view",
"recline_graph_view",
"recline_map_view",
"text_view",
"image_view",
"recline_view",
"pdf_view",
"webpage_view",
"dcat",
"dcat_rdf_harvester",
"dcat_json_harvester",
"dcat_json_interface",
"geo_view",
"geojson_view",
"spatial_metadata",
"spatial_query",
"geopusher",
"harvest",
"ckan_harvester",
"csw_harvester",
"hierarchy_display",
"hierarchy_form"
]
}

We've got ~1200 datasets in that instance.

Please describe the expected behaviour

Dataset pages should load at normal speeds
Please describe the actual behaviour
Dataset pages load in no less than 20-25 seconds
Home page layout including featured orgs and groups loads in no less than 20-25 seconds
Home page layout excluding features orgs and groups loads at normal speeds
Dataset list, org list loads at normal speeds
Resources via API load fine (e.g. loading CSV) - haven't speed-tested all API endpoints yet

We've had a period of lower usage, where no one reported any bottlenecks (or bottlenecks went unnoticed). In that time I upgraded the CKAN instance from 2.5 to 2.6.0a but I haven't got enough logs to correlate the drop in performance to a specific upgrade.

I've got a slightly non-standard setup:

I'm using a few of my own forks of extensions, still need to test with bare bones CKAN
I'm using a datacats setup, running CKAN, datapusher, postgres, SOLR and Redis in separate docker containers
I haven't pulled the latest ckan master (2.7.0a at the time of writing), as this causes a Solr schema conflict - my Solr instance comes packaged with no longer supported datacats (latest docker images have solr schema version 2.3, lateset CKAN requires solr schema version 2.7)
I've applied the database optimisations of the Performance tips, no measurable improvement in page load speed, although total index size went from ~2GB to ~1GB - diagnostic on index size following this article
The test instance with a handful of datasets runs quicker, so the slowness seems to correlate with number of datasets
One dataset page loads in about 20 sec, most time spent on TTFB while CKAN is thinking
Running the datacats/CKAN instance on an AWS t2.large did max out the CPUs with postgres AUTOVACCUM and IDLE, changing to a more CPU-optimised c4.large brought the CPU load down, AUTOVACUUM is not noticeable any more, but there's still a postgres IDLE process when someone's loading a dataset page
According to logs (loglevel DEBUG), this query may be the culprit, or the postgres IDLE running immediately before this query:

2016-10-04 04:33:28,689 DEBUG [ckan.lib.search.query] 
Package query: {'fq': [' +site_id:"default" +state:active'], 'facet.mincount': 1, 'rows': 2, 'facet.field': ['groups', 'owner_org'], 'facet': 'true', 'q': '+capacity:public', 'facet.limit': -1, 'wt': 'json', 'fl': 'groups'}

Datacats generates its own passwords, to export the postgres password, run the output of

cat /path/to/DATACATS_DATA_DIR/ENVIRONMENT_NAME/sites/primary/passwords.ini | grep postgres | awk -F' = ' '{print "export PGPASSWORD=" $2}'

Then activate datacats shell. Non-datacats setups will require their own host, database name, and username in the next line (modified from Aleksey Tsalolikhin):

watch -n 0.25 "psql -h db -d ckan -U postgres -c ' SELECT datname,pid,query FROM pg_stat_activity;'\
 | grep -v IDLE | grep -v 'SELECT datname,pid,query FROM pg_stat_activity;' \
| grep -v ^\(  | grep -v 'datname  | pid ' | grep -v -- '-----------+---------+-'"

which monitors current db queries. Here I found the following query running a few times at every dataset page load, exactly during the 20 sec TTFB wait (update: formatting):

SELECT count(*) AS count_1
FROM (
    SELECT 
        revision.id AS revision_id, 
        revision.timestamp AS revision_timestamp, 
        revision.author AS revision_author, 
        revision.message AS revision_message, 
        revision.state AS revision_state, 
        revision.approved_timestamp AS revision_approved_timestamp
    FROM revision
    WHERE revision.author = 'michael_rule' 
    ORDER BY revision.timestamp DESC
) AS anon_1

What steps can be taken to reproduce the issue?

Setup CKAN 2.6.0a following these docs
Create ~1200 datasets
Load a dataset page
update create 2.6 million revisions and 740k activities by vigorously updating datasets ( @amercader - how many have you got?)

Next I'll create an index on revision(author) and see whether that helps. But I cannot imagine that I'm the only one with this problem if that string lookup on revision.author were the true blocker.

If that won't help I'll create a completely new CKAN instance and migrate the data. In the mean time I'm open to suggestions where to look/debug/test next. Anyone seen similar degraded performance lately?

amercader commented 8 years ago

Hi @florianm, thanks for the thorough report.

I can't replicate this with 1500 datasets so just the number of them is not the issue. Some things to try / report:

Does it happen when disabling all extensions? If it doesn't happen gradually enable them until you can find the culprit.
Does the package_show API call also take that long? This would discard template rendering issues

What seems off to me is that you see degraded performance on the dataset page and the homepage with the org/group widgets, which would suggest a problem with the search. But then you mention that the dataset list works fine. Does the orgnanization/group page (the ones that lists their datasets) work fine?

wardi commented 7 years ago

This is very unusual, would you enable profiling and attach the report?

florianm commented 7 years ago

@wardi @amercader will toggle extensions until I find minimum problem set and enable profiling.

I most likely have introduced this by pulling latest CKAN master into the datacats CKAN up until a few months ago.

florianm commented 7 years ago

Apologies for the delay, 'tis the flu season. I haven't cracked the problem yet, but here are some random observations:

My slow query relates to a db column introduced in migration 84 - after 2.5.3 and in 2.6.0. This correlates with CKAN being fast roughly when I ran 2.5.3 and being slow since I ran 2.6.0.
Disabling all extensions did not impact the slow query and did not fix the performance issue.
Migrating the database backwards (2.7.0a to 2.5.3) is not possible out of the box. Once I've fixed the db, I can try the API package_show.

I'm now running CKAN 2.5.3 using datacats and getting

Error while starting web container:
/usr/lib/ckan/local/lib/python2.7/site-packages/sqlalchemy/sql/default_comparator.py:33: SAWarning: The IN-predicate on "group.name" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate.  Consider alternative strategies for improved performance.
  return o[0](self, self.expr, op, *(other + o[1:]), **kwargs)
Debug at: http://0.0.0.0:5000/_debug/view/1478506946

With CKAN 2.5.3, the dataset pages still load slowly.

@amercader you're right - api/3/action/package_show?id=... loads instantly, /api/3/action/group_show?id=mpa-reporting loads slowly. Even after removing all groups from a dataset, the dataset page (GUI) still loads slowly.

florianm commented 7 years ago

It seems we've got a fair amount of revisions and activities which simply bog our instance down. We're updating ~600 datasets annually multiple times (using R scripts, see video).

In addition, the SQL behind the activity stream fetches all records, then limits the output to the limit as per CKAN config.

Maybe related, CKAN seems to rollback the select queries run when displaying dataset / group etc. pages.

The performance has improved from ~ 20 sec TTFB to ~ 3 sec TTFB (while vacuum analyze still runs) on dataset pages by deleting old revisions and activities. Also some btree indices seem to help (acvtivity shown, revision not shown):

delete from revision where timestamp < '2016-11-01';
DELETE 2693688

select count(*) from activity;
 count
--------
 743876
(1 row)

ckan=# delete from activity where timestamp < '2016-08-01';
DELETE 572303

vacuum analyze;

create index idx_activity_id on activity using btree(id);
create index idx_activity_timestamp on activity using btree(timestamp);
create index idx_activity_userid on activity using btree(user_id);
create index idx_activity_objectid on activity using btree(object_id);
create index idx_activity_revisionid on activity using btree(revision_id);

So it looks like there's some potential to streamline queries and indices, or cleaning out activities and revisions occasionally.

The slow performance I'm reporting was caused by having 2.6 million revisions and 740k activities, and improved through truncating these tables as shown above.

wardi commented 7 years ago

@florianm does the revision DELETE casade to all the *_revision tables? If so that's much simpler than the cleaning-up-revisions script from data.gov.uk.

@TkTech have you pushed your recent work on the activity table and query up somewhere? @florianm's 3s is still way too slow, I suspect your change might fix this one.

wardi commented 7 years ago

I think #2959 was the last activity-performance-fixing PR

florianm commented 7 years ago

@wardi I've applied some tuning and got rid of some indices and FK constraints (many involving _revision tables). My delete was non-cascading, and I'm not sure whether without my deleted FK constraints CASCADE would have dropped related revisions as well. Maybe cleaning out those other _revision tables might speed up my 3 sec dataset page load a bit more.

As my performance problem is fixed for now, and we're working on a new (sadly non-datacats) CKAN deployment, I'm happy to test @TkTech's patches and looking forward to seeing them merged to master.

TkTech commented 7 years ago

@florianm is this still an issue for you? I'm currently testing with 15m datasets and a few hundred million revisions and not running into this particular issue. It's possible that one of your plugins is responsible for the performance drop.

florianm commented 7 years ago

@TkTech yes, but I'm not using the latest CKAN (see above), as I'm running a datacats setup. My dataset list loads reasonably fast, but individual dataset pages can take a while. I'm closing this issue as I haven't reproduced it with latest master. @amercader @wardi @TkTech thanks for your input!

TLDR: this issue documents how to monitor postgres queries and identify performance bottlenecks in sql / model / content.

TkTech commented 7 years ago

Master recently merged a few of my PRs to improve performance, in particular with activities. Glad it worked! We have an ongoing epic to fully support 15m datasets at a minimum.

florianm commented 7 years ago

@TkTech looking forward to battletest when I'm migrating from datacats to CKAN's native docker :-) Epic will be needed when people are programmatically updating their resources, e.g. from https://github.com/datawagovau/ckan-o-sweave

MadsRCEnerginet commented 6 years ago

@wardi Sorry for necroing an old issue, but you mention a script for data.gov.uk to clean up old revisions - Would you happen to have a link (I've been unable to locate it)

florianm commented 6 years ago

@MadsRCEnerginet see my comment above - the delete from revision where timestamp < 'YYYY-MM-DD; bit did the trick.

MadsRCEnerginet commented 6 years ago

@florianm - I tried the same approach a few weeks back, however I ran into an issue where a lot of the entries had references that prevented a delete from happening. Thus I'd have to manually delete references, which with the current revision system is... problematic and slow. So I was searching for a premade script to handle this before requesting time to create one myself from my team.

florianm commented 6 years ago

@MadsRCEnerginet aw shoot, I suppose a cascade won't help either or destroy the wrong data, right? Sorry, that's all I can remember about this, good luck!

MadsRCEnerginet commented 6 years ago

I think it would help, but, as you mentioned, could destroy the wrong data... Never really been a fan of running cascade on a schema I'm not 110% comfortable with :dancing_men:

Thanks for reaching out though :+1:

florianm commented 6 years ago

No worries, and cascades are only half as scary with good backups :-D

wardi commented 6 years ago

@davidread do you still have a copy of the revision-trimming sql data.gov.uk has used?

davidread commented 6 years ago

@MadsRCEnerginet The code I can track down was when I had an issue was with resource_revisions. I had to drop constraints. Something like this:

BEGIN;
            ALTER TABLE package_tag DROP CONSTRAINT package_tag_revision_id_fkey;
            ALTER TABLE package_extra DROP CONSTRAINT package_extra_revision_id_fkey;
            ALTER TABLE resource DROP CONSTRAINT resource_revision_id_fkey;

            # now drop resource_revisions
            # e.g. DELETE from resource_revision where id='x' and revision_id='y';

            ALTER TABLE package_tag ADD CONSTRAINT package_tag_revision_id_fkey FOREIGN KEY (revision_id) REFERENCES revision(id);
            ALTER TABLE package_extra ADD CONSTRAINT package_extra_revision_id_fkey FOREIGN KEY (revision_id) REFERENCES revision(id);
            ALTER TABLE resource ADD CONSTRAINT resource_revision_id_fkey FOREIGN KEY (revision_id) REFERENCES revision(id);
COMMIT;

See: https://github.com/datagovuk/ckanext-dgu/blob/master/ckanext/dgu/bin/wms_revisions.py

MadsRCEnerginet commented 6 years ago

@davidread Thanks! That should point me in the correct direction!

ckan / ckan