Seeking views of administrators on proposal to expose audiobook keywords and promote exploration of our catalog via keyword hyperlinks

peterjdann commented 3 months ago

redrun has suggested to me that before creating any signifcant pull request, it would be a good idea if I were to put what I have in mind before administrators to get their views on whether the proposed change makes sense at all, or perhaps makes sense in principle, but requires modification. I think this is a most reasonable suggestion, but am not sure by what mechanism I should make any such approach.

I have attached a PDF file which I hope shows pretty clearly, at a GUI level at least, what I have in mind, with screenshots taken on my local development environment.

I'd be interested in the thoughts of all stakeholders. If nothing else, I've learned a hell of a lot by exploring this question. If this doesn't go anywhere, I've at least got that to hang onto!

Proposal to expose keyterms to catalog users 2024 03 19.pdf

redrun45 commented 3 months ago

Personally, I like this a lot. As for consensus: discussion has just begun. Public discussion here, and admins will probably have an internal discussion as well.

On a purely technical note (else I'd post it in the linked forum thread), I think we could benefit from counting up the projects that use each keyword with a cron job, just like we do with authors and genres. It would grow our database, but I think it would still be much faster than cross-referencing and counting projects on page load. Worth testing!

peterjdann commented 3 months ago

Thank you! This is a great example of why it’s good to get other people’s perspectives. I had not given any consideration at all to performance issues, and certainly not to this elegant potential solution.

I did a bit of hunting around today for some kind of off-the-shelf load testing application and came across locust, which seems to fit the bill nicely for this use case. Sadly, however, locust is unable to issue a successful GET to any page on my local Apache server because of the dreaded SSL certificate issue, which none of the workarounds I have read about online this afternoon seem to address successfully anymore. I can see locust’s requests in the Apache access log, where they all show up with a 301 return code, while locust itself reports the result of every GET as “SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate (_ssl.c:1007)'))”. That’s a pity, because it looks like locust is a really brilliant tool.

Best wishes,

Peter Dann

On 20 Mar 2024, at 3:27 am, redrun45 @.***> wrote:

Personally, I like this a lot. As for consensus: discussion has just begun. Public discussion here https://forum.librivox.org/viewtopic.php?p=2310123#p2310123, and admins will probably have an internal discussion as well.

On a purely technical note (else I'd post it in the linked forum thread), I think we could benefit from counting up the projects that use each keyword with a cron job, just like we do with authors and genres https://github.com/LibriVox/librivox-catalog/blob/master/application/controllers/cron/Project_status_stats.php. It would grow our database, but I think it would still be much faster than cross-referencing and counting projects on page load. Worth testing!

— Reply to this email directly, view it on GitHub https://github.com/LibriVox/librivox-catalog/issues/210#issuecomment-2007627314, or unsubscribe https://github.com/notifications/unsubscribe-auth/APUJ53H43FADGUF6KB74QIDYZBRONAVCNFSM6AAAAABE42RKFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGYZDOMZRGQ. You are receiving this because you authored the thread.

redrun45 commented 3 months ago

If you're writing your own 'locustfile' test specs, this might be of use. Apparently, you can use the same options as this module. Without trying it myself: .get('https://librivox.org/<your query>', verify=False) ... ought to help with that SSL verification.

peterjdann commented 3 months ago

Suggestion from redrun45 proved to be a lifesaver. Tbank you! I have now run load tests with locust against three scenarios:

Baseline (main branch, untouched)
Version of code that calculated keyword usage dynamically at runtime
Version of code that reads usage statistics from keywords table that are presumed to be updated by a cron job, but which I have, for now, updated manually using the same SQL that would be used in any putative cron job

I have attached in a zip file reports from all three runs, together with the locust file I used for these tests. In brief, the results show the following average response times on my local machine under the test conditions I set up:

Baseline (no code changes): 491 ms
Precalculated statistics: 484 ms
Dynamically calculated statistics: 816 ms

At face value, anyway, this suggests to me that we could display keywords to catalog users with no discernible degradation of server performance provided we kept the keyword usage statistics updated with a cron job, rather than calculating them each time on the fly. I'm no expert in load testing. If the test conditions I've used here do not appear to be sufficiently rigorous, please let me know. locust file and load test reports.zip

notartom commented 3 months ago

That's a pretty massive hit for dynamic statistics. We could always go with the Cron job, but I'm curious what the SQL to calculate those was, and whether it can be made faster by adding foreign keys and/or indexes to our schema.

peterjdann commented 3 months ago

Good point.

I have used the following SQL for a dynamic count of the number of instances where a keyword is in use. As you can see, it performs a join involving project_keywords.keyword.id. This column, however, is not currently indexed in the database.

    SELECT keywords.id, keywords.value, sub.keyword_count
    FROM keywords
    JOIN project_keywords 
    ON keywords.id = project_keywords.keyword_id
    JOIN (SELECT keyword_id,
           COUNT(project_id) AS keyword_count
           FROM project_keywords
           GROUP BY 1
           ) sub
    ON keywords.id = sub.keyword_id
    WHERE project_keywords.project_id = ?
    ORDER BY sub.keyword_count DESC'

It would certainly be a good idea to establish a foreign key relationship between project_keywords.keyword_id and keywords.id.

My first attempt to set one up was frustrated by the fact that there are 1910 instances where project_keywords.keyword_id = 0. I needed to delete these before I could set up a foreign key on this column. I then set up an index on. project_keywords.keyword_id.

Used the following SQL through these adventures:

SELECT keyword_id from project_keywords where keyword_id NOT IN (select id from keywords);

===

DELETE from project_keywords where keyword_id < 1;

===

ALTER TABLE project_keywords ADD CONSTRAINT FOREIGN KEY (keyword_id) REFERENCES keywords(id) ON DELETE CASCADE;

===

CREATE INDEX keyword_id ON project_keywords(keyword_id);

After making these DB changes, I load tested my version of the code that counts keyword instances at runtime (ie, no cron job updating required) using the same load test I used previously, and this time got an average response time of 546 ms.

I have attached this load test report. 05 load test using dynamically calculated counts of keyword usage after indexing previously unindexed column in project_keywords table.html.zip

peterjdann commented 3 months ago

After sleeping on it, realised that logically there should also be a foreign key relationship between project_keywords and projects table. Set this up: ALTER TABLE project_keywords ADD CONSTRAINT FOREIGN KEY (projects_id) REFERENCES projects(id) ON DELETE CASCADE;

After making this further DB change, ran the same load test again, this time getting an average response time of 571 ms. I have attached this load test report. 06 load test using dynamicallly calculated counts of keyworkd usage after additionaly settin up foreign key relationship between project_keywords and projects tables.html.zip

redrun45 commented 3 months ago

We've kicked this back and forth and not had any objections, other than to changing the 'keyword' term to 'keyterm'. When the PR is ready, I'll be glad to test it, and we might also want this on the staging server for other admins. 😄

LibriVox / librivox-catalog

Seeking views of administrators on proposal to expose audiobook keywords and promote exploration of our catalog via keyword hyperlinks #210