Islandora-Labs / islandora_usage_stats_charts

Module that provides charts showing usage statistics for Islandora objects.
GNU General Public License v3.0
1 stars 0 forks source link

"Download" (actually View) numbers WAAAY off #14

Closed bondjimbond closed 6 years ago

bondjimbond commented 6 years ago

See http://kora.kpu.ca/islandora/object/kora:43

Usage reports block shows 152 views ever, 14 downloads.

islandora_usage_stats_csv agrees.

This module reports a CRAZY number of views. Where did the number come from?

Note that both modules' exclusion lists are using the same default settings.

bondjimbond commented 6 years ago

screen shot 2018-03-27 at 4 14 25 pm

mjordan commented 6 years ago

This module is getting its data from the islandora_usage_stats_object_access_log and islandora_usage_stats_object_ds_access_log tables. Using raw SQL queries against those tables, I can compare those results with the data displayed in the charts. Run the queries yourself and figure out if the number of entries in those tables for you is not reflected in the charts. If the numbers are off, maybe there's a problem with the SQL query.

For a given object and its datastreams, you first need to get the pid_id:

SELECT islandora_usage_stats_objects.id AS id FROM islandora_usage_stats_objects islandora_usage_stats_objects WHERE pid = 'hiv:55';

The result is 277. Then, using that pid_id, issue the following SQL query to get the timestamps of the access log entries. You will need the current UNIX timestamp and one from 6 months ago (assuming that the number of months that you have configured in the chart module's admin settings. You can get the timestamps using https://www.unixtimestamp.com/index.php or by using this PHP code: $ago = strtotime('-' . $num_months . ' months', time());

SELECT islandora_usage_stats_object_access_log.time AS time FROM islandora_usage_stats_object_access_log islandora_usage_stats_object_access_log WHERE (pid_id IN ('277')) AND (time BETWEEN '1506729600' AND '1522417840');

I get this:

mysql> SELECT islandora_usage_stats_object_access_log.time AS time FROM islandora_usage_stats_object_access_log islandora_usage_stats_object_access_log WHERE (pid_id IN ('277')) AND (time BETWEEN '1506729600' AND '1522417840') ;
+------------+
| time       |
+------------+
| 1520097279 |
| 1520098123 |
| 1520098707 |
| 1520099292 |
| 1520099615 |
| 1520100364 |
| 1522417338 |
+------------+
7 rows in set (0.00 sec)

My chart for 'hiv:55' shows 7 hits. OK so far.

The SQL query to get downloads:

SELECT dsid,time FROM islandora_usage_stats_datastreams, islandora_usage_stats_object_ds_access_log WHERE islandora_usage_stats_object_ds_access_log.time BETWEEN '1506729600' AND '1522427840' AND islandora_usage_stats_datastreams.pid_id = '277' AND dsid NOT IN ('RELS-EXT','RELS-INT','TECHMD','MODS','DC');

I get the expected number (46), which is the same as the downloads in the chart.

It's tricky to issue these commands since you have to make sure the timestamps are not excluding hits/downloads you perform while testing. The "between" clause in the SQL is relative to when the chart is rendered, e.g. between now and six months ago. "Now" changes every second. You can view the object, or download a datastream, to see if the numbers increase, but if you do, make sure you adjust your timestamps. Also, make sure that the Omit PIs and Cooldown time settings in the main Usage Stats admin settings don't exclude your testing attempts.

mjordan commented 6 years ago

@bondjimbond WRT downloads, perhaps a reason that the downloads numbers are higher than you expect is that this module (currently) counts downloads of all datastreams except ones that are in the exception list. So if someone downloaded the TN and OBJ for the same object, the downloads count would increment by 2. So in effect, the downloads value is an aggregate of all the downloads of all datastreams other than any that excluded, not a count of only OBJ. My assumption is that this is not a big deal, since users don't normally download the TN, etc. This way of defining "downloads" allows us to not explicitly define which datastream we count for which content model.

If you wanted to define your own mapping of content model to datastream, we could add the ability to the drupal_alter hook to include the DSID, thereby allowing you to filter the data that gets pushed out to the chart.

mjordan commented 6 years ago

@bondjimbond have you had a chance to follow up on this? I know verifying the data looks complex, but I can't think of any other way to test the accuracy of the charts.

bondjimbond commented 6 years ago

Haven't had a chance to look into it yet, but I am planning to.

bondjimbond commented 6 years ago

@mjordan Any advice on how I could perform such SQL queries on my Vagrant machine?

mjordan commented 6 years ago
  1. mysql -uroot -pislandora drupal7
  2. Get the pid_id for the object you are testing: SELECT islandora_usage_stats_objects.id AS id FROM islandora_usage_stats_objects islandora_usage_stats_objects WHERE pid = 'hiv:55'; (use your own PID here instead of hiv:55)
  3. Get the unix timestamp of six months ago (this is to the second) at https://www.unixtimestamp.com/index.php
  4. The query that gets the usage entries is (plug in your pid_id and timestamp) is: SELECT islandora_usage_stats_object_access_log.time AS time FROM islandora_usage_stats_object_access_log islandora_usage_stats_object_access_log WHERE (pid_id IN ('277')) AND (time BETWEEN '1506729600' AND '1522417840');

This should get you going.

bondjimbond commented 6 years ago

Numbers seem to check out OK so far on my test objects..

But can you explain what's happening here? https://arcabc.ca/islandora/object/kora:43

bondjimbond commented 6 years ago

On the above example, the stats are incredibly crazy. 22,817 downloads in January? Whereas the CSV shows no downloads or views in January.

1,237 in March according to the chart, 2 in March according to the CSV.

mjordan commented 6 years ago

I suspect the difference between the highest month and the lowest month is causing the chart to distort. If you have the "Limit chart height to its width" option checked in your config, can you uncheck it and see if that chart is still distorted?

mjordan commented 6 years ago

Can you send me the CSV?

bondjimbond commented 6 years ago

There should be very little difference -- I have no idea where these inflated numbers are coming from. CSV converted to XLSX attached. Charts and CSV have the same filters on them.

I've turned off the "limit chart height" option, and the result is the same.. take a look.

usage_stats_kora_43.xlsx

mjordan commented 6 years ago

I wonder if the SQL query it timing out on 2018-01 and not getting into the CSV?

bondjimbond commented 6 years ago

I think the CSV numbers are more realistic than the chart numbers. I just have no idea what the real numbers are - I can't access the MySQL database (no user ID or password for it).

But perhaps there's another way to get at this information. Using islandora_usage_stats_callbacks I get an all-time history for this object of 2609 views, 1164 downloads (which also seems unrealistic, but not as insane as the chart). Can you advise on what filters I should put in the Charts config in order to mimic what Callbacks fetches? Perhaps by comparing the two we can figure out if there's something strange happening with this module, or the database, or something else entirely.

mjordan commented 6 years ago

Good idea. Let me look into this. I am thinking that at this early stage in the life of the charts and CSV module, we need some diagnostic/testing tools. I've got a utility module (https://git.lib.sfu.ca/mjordan/islandora_usage_stats_generate_test_access_data) that will load test data, but we should be thinking about more.

mjordan commented 6 years ago

Lost track of this... but I'll point out that you can access the username and password of your database in your site's setting.php file, if you wanted to take a look directly in the db.

ajbosman commented 6 years ago

Thank you for this module, which roughly meets my needs 👍

I was testing the module in a test/dev Islandora, an old 7.x-1.5 version, and found out about the waaaaay to high download counts. I looked at the download queries in includes/block.inc (lines 92 and 95). The queries are missing an important limitation, which can be fixed by adding this to the WHERE clause:

AND {islandora_usage_stats_datastreams}.id = {islandora_usage_stats_object_ds_access_log}.ds_id

This limits on the datastream id's of the selected pid. Otherwise you count ALL the datastreams with the selected dsid (or dsid's) between :ago and :now instead of only the ones of the selected pid_id.

The complete query on line 92 becomes: $downloads_result = db_query("SELECT dsid,time FROM {islandora_usage_stats_datastreams}, {islandora_usage_stats_object_ds_access_log} WHERE {islandora_usage_stats_object_ds_access_log}.time BETWEEN :ago AND :now AND {islandora_usage_stats_datastreams}.pid_id = :pid_id AND {islandora_usage_stats_datastreams}.id = {islandora_usage_stats_object_ds_access_log}.ds_id AND dsid NOT IN (:excluded_dsids)", $placeholders);

And the complete query on line 95 : $downloads_result = db_query("SELECT time FROM {islandora_usage_stats_datastreams}, {islandora_usage_stats_object_ds_access_log} WHERE {islandora_usage_stats_object_ds_access_log}.time BETWEEN :ago AND :now AND {islandora_usage_stats_datastreams}.pid_id = :pid_id AND {islandora_usage_stats_datastreams}.id = {islandora_usage_stats_object_ds_access_log}.ds_id", $placeholders);

mjordan commented 6 years ago

@ajbosman thanks for spotting that! I've pushed up the issue-14 branch incorporating your fix.

@bondjimbond can you test that branch to see if the results are more consistent with what you are expecting?

bondjimbond commented 6 years ago

Awesome! kora:43 went from 12,000 downloads in April to 12. I think this works.

mjordan commented 6 years ago

Thanks for confirming - I'll merge the issue-14 branch into 7.x later today and close this. Thanks to both of you.

mjordan commented 6 years ago

Closed with cf18fa097e937a96f8bde3a18738e3f07bd8cffe.

Thanks again @bondjimbond for raising the issue and @ajbosman for resolving it.

mjordan commented 6 years ago

FYI, those WHERE constraints are already in the Usage Stats CSV module.