asulibraries / islandora-repo

ASU Digital Repository on Islandora
GNU General Public License v2.0
4 stars 4 forks source link

Slow loading for large collections that have many items #270

Closed wgilling closed 3 years ago

wgilling commented 3 years ago

This appears to be caused by one of the statistics boxes of the AboutThisCollection.php block.

https://prism.lib.asu.edu/collections/177

wgilling commented 3 years ago

The nested loop that calculates the views and items really needs to be optimized so that it does not need to run that many queries on the database. This currently loops through $nids and inside of that it loops through the $nids' $res_type or (# of nodes) x (# of res_type) https://github.com/asulibraries/islandora-repo/blob/develop/web/modules/custom/asu_collection_extras/src/Plugin/Block/AboutThisCollectionBlock.php#L156-L180

Inside of this loop, it is also calling the Matomo service to get each nodes' view count.

Also, this loop is calling getOriginalFileCount a function that itself is looping twice and potentially calling itself recursively (for children of children of the collection). Unless we could say that objects would never be more than two or three layers deep, then we need the recursion.

elizoller commented 3 years ago

the three issues are $files += $this->getOriginalFileCount($child_nid, $original_file_tid); $this->entityTypeManager->getStorage('node')->load($child_nid); and $node_views = $this->islandoraMatomo->getViewsForNode($child_nid); in the AboutThisCollectionBlock

wgilling commented 3 years ago

the recursive part in getOriginalFileCount could be done with the solr query that gets all children using the itm_field_ancestors field such as getCollectionNids? so that it does not need to do recursion at least -- the MySQL could use the returned $nid values in a where clause like node_field_data.nid IN ( {{ the returned $nids imploded with a comma }} ). The other option is to index a flag to indicate there is an original file in Solr so this wouldn't have to be two steps here. (edited) 3:19 realizing that idea of keeping a flag for whether or not there is an Original File on any node has to be an enhancment to tie into the media-related hook or something similar.

Eli Zoller 3:21 PM we could certainly try a solr query and see if its more performant. as far as the original file bit - yeah we'd have to add an indexed value that aggregates up from the media. i don't think thats too much trouble

Willow Gillingham 3:23 PM it will be easy to get the recursion out for now - and very easy to use the solr value for it if / when the media flag could be read in that same query at a later time... until then, I think it will be pretty easy to have one sql statement to get the entire set of nodes' file count at one time.

elizoller commented 3 years ago

Design team supported @wgilling idea of using a nightly cache table for the "views" count I found two things that could either make that easier OR possibly make it so that we don't need it at all

  1. you can include multiple items in a call to matomo API

Some parameters can optionally accept arrays. For example, the urls parameter of SitesManager.addSite, SitesManager.addSiteAliasUrls, and SitesManager.updateSite allows for an array of urls to be passed. To pass an array add the bracket operators and an index to the parameter name in the get request. So, to call SitesManager.addSite with two urls you would use the following array:

https://demo.matomo.org/?module=API&method=SitesManager.addSite&siteName=new%20example%20website&urls[0]=https://example.org&urls[1]=https://example-alias.org

  1. you can use the API in bulk

Sometimes it is necessary to call the Matomo API a few times to get the data needed for a report or custom application. When you need to call many API functions simultaneously or if you just don't want to issue a lot of HTTP requests, you may want to consider using a Bulk API Request. This feature allows you to call several API methods with one HTTP request (either a GET or POST).

To issue a bulk request, call the API.getBulkRequest method and pass the API methods & parameters (each request must be URL Encoded) you wish to call in the 'urls' query parameter. For example, to call VisitsSummary.get & VisitorInterest.getNumberOfVisitsPerVisitDuration at the same time, you can use:

https://demo.matomo.org/?module=API&method=API.getBulkRequest&format=json&urls[0]=method%3dVisitsSummary.get%26idSite%3d3%26date%3d2012-03-06%26period%3dday&urls[1]=method%3dVisitorInterest.getNumberOfVisitsPerVisitDuration%26idSite%3d3%26date%3d2012-03-06%26period%3dday Notice that urls[0] is the url-encoded call to VisitsSummary.get by itself and that urls[1] is what you would use to call VisitorInterest.getNumberOfVisitsPerVisitDuration by itself. The &format is specified only once (format=xml and format=json are supported for bulk requests).

I got this from https://developer.matomo.org/api-reference/reporting-api It would require additional methods in the MatomoService: https://github.com/asulibraries/islandora_matomo/blob/master/src/IslandoraMatomoService.php

wgilling commented 3 years ago

I looked at the bulk matomo method and the ability to send an array parameter, but I am not sure that these would methods would do much to address our largest collections.

I would like to proceed with a summary table that could fetch the collection stats at night and it could store all of the values that we need for this page... views, downloads, items, files, # of resource types, and even last updated date. These records could also be stored with a date so that we could show "views/downloads/items over time" if we wanted to. Yet, if we want to add a graph like this to a metrics area (with the altmetrics stuff) we would need to store each item and their collection would have to be stored in an indexed table.

wgilling commented 3 years ago

The latest commit contains configuration and code for a new Solr processor. This is contained in the asu_search/src/Plugin/search_api/processor/OriginalFileCount.php. Because of this, all of the content will need to be reindexed in Solr before this will work.

The remaining optimization step would be to load the usage values from a summary table that is populated during an off-hours CRON process per collection. The controller code for this block would then only need to load one record from that table for the usage.

wgilling commented 3 years ago

Rather than make a pull request yet, the code should be reviewed first. This currently has the non-dependency injection code just sitting in the .module file for now.

Installing the module will create the asu_collection_extras_collection_usage table. On an existing site, this will require uninstalling first and then installing.

drush pm-uninstall asu_collection_extras
drush pm-enable asu_collection_extras

To populate this table, via drush like: drush asu_collection_extras:collection_summary. This command will ultimately be set up as a crontab line such as: 1 0 * * * /usr/local/bin/drush asu_collection_extras:collection_summary >/dev/null 2>&1