LibriVox / librivox-catalog

LibriVox catalog and reader workflow application
https://librivox.org
MIT License
37 stars 17 forks source link

Keywords search dropdown for Advanced Search #231

Open peterjdann opened 3 months ago

peterjdann commented 3 months ago

This pull request proposes: (a) Fixing a bug in the catalog's Advanced Search function that currently results in the generation of an incorrect list of results if a user enters a multiterm keywords entry such as "political fiction" or "french revolution" (b) Adding an autocomplete style dropdown list to the keywords field of the catalog Advanced Search

I believe implementing this dropdown will make it much easier than it is presently for users prepared to try our Advanced Search function to find an audiobook that may match a specific subject or genre interest.

Anyone reviewing this code proposal will notice that the SQL used to populate the keywords dropdown list does a look up of the project_keywords table. The purpose of this lookup is to avoid listing in the keywords dropdown keywords which are, in effect, orphans — that is to say, terms that appear in the keywords table, but are never referenced in the project_keywords table. In the scrubbed database, there are currently 2928 such terms. They can be viewed by running the following SQL against the scrubbed database:

SELECT DISTINCT k.value, k.id FROM keywords k where k.id NOT IN (SELECT DISTINCT pk.keyword_id from project_keywords pk) ORDER BY k.id ASC;

peterjdann commented 3 months ago

Have made a number of changes now in the light of garethsime's very helpful suggestions. Thank you very much for your care and attention to this, garethsime!

peterjdann commented 3 months ago

Have just added keywords escaping, as suggested by Gareth. Thank you again, Gareth. Bit of a dimwit you're dealing with here!

garethsime commented 3 months ago

Also, I know this is completely off-topic for the PR, but I'm curious to know how your local dev setup works?

There was a fair amount of discussion a while back on how best to support other devs wanting to make contributions, so I'm always interested in hearing how people went with the setup and what they ended up doing, if you can spare the time :slightly_smiling_face:

notartom commented 3 months ago

Hi! I'm not a Librivox dev, so feel free to ignore my comments, but I do contribute occasionally, so I thought I'd throw in my suggestions 🙂

You're as much a Librivox dev as any of us - as I've mentioned before in various places, the original author has long since moved on, so we're all trying our best to figure out the code base and make incremental improvements, hopefully without breaking anything. Your suggestions and comments are definitely valid, and I welcome your reviews of any PR.

peterjdann commented 3 months ago

Does strike me as possibly overkill, but I've now parameterised the query that looks up keywords in use, following a comment from notartom (which comment has now been deleted, I think?)

peterjdann commented 3 months ago

Also, I know this is completely off-topic for the PR, but I'm curious to know how your local dev setup works?

There was a fair amount of discussion a while back on how best to support other devs wanting to make contributions, so I'm always interested in hearing how people went with the setup and what they ended up doing, if you can spare the time 🙂

Have looked at my old notes on how I did this the first time on a Mac running Ubuntu under Parallels Desktop. Was able to pare down the necessary instructions to a fairly simple procedure, which I was able to verify worked correctly on a second Mac of mine running a similar setup this afternoon. I have posted those instructions in the place where the original discussion took place.

peterjdann commented 3 months ago

Thé IS a thing in French (tea). to a quick check before you delete that.

On Wed, 19 Jun 2024 at 8:48 AM, redrun45 @.***> wrote:

@.**** commented on this pull request.

In application/libraries/Librivox_search.php https://github.com/LibriVox/librivox-catalog/pull/231#discussion_r1645189996 :

@@ -122,7 +122,7 @@ function advanced_title_search($params) $keyword_clause = ''; if (!empty($params['keywords'])) {

  • $keywords = explode(' ', $params['keywords']); //maybe preg_match if extra spaces cause trouble - thinnk we're ok
  • $keywords = array($params['keywords']);

I'm not suggesting this proposal is perfect — only that it's a lot better than what we have at the moment.

Noted! If we ever get around to an auto-completing, multi-keyword-entering box, we can use that in the Project Template Generator and here in Advanced Search. Meanwhile, the visibility of what keywords we might search, straight from within the search form, is a big step forward.

Now, we'd only be matching items that had the literal tag "duck sheep", but I think this is a good change -- the auto-complete should make it way easier to discover tags. (Might need to label the field "Keyword" rather than "Keywords" though.)

This part also seems like at least a slight improvement overall, but comes with a trade-off. If there's a way to alleviate that down-side, great! Otherwise, we may take two steps forward,one step back, and then be happy about it.

The problem that this particular change would solve: Searching by keyword for "American Revolution" currently shows keywords like "American Humor" and "French Revolution". Yeah, that's bad. 😅

But, on the other hand: Searching by keyword for "Winnie the Pooh" currently shows results tagged with "winnie", "winnie-the-pooh", "winnie the pooh", and simply "pooh". This is good. Dear Pooh may be a silly old bear, but his nose will always lead him somewhere. (Oh, and this same search will also show the one project with keyword "Thé". Excuse me while I make that not a thing, on live...) A few projects will certainly have their keywords updated, as those become both more visible and more useful, but overhauling the keywords on our thousands of projects is actually even more work than most power-users would imagine, and is not in the scope of this PR.

If quoting a multi-word "keyword" to keep it together is easier for either of you "hotshots" than it is for me (I've not written so much new code as either of you!), or if you can think of another way to keep the best of both worlds, then I'm all ears. Otherwise, I'll certainly be here for the "running by" when we're sure we know our limitations. 😉

— Reply to this email directly, view it on GitHub https://github.com/LibriVox/librivox-catalog/pull/231#discussion_r1645189996, or unsubscribe https://github.com/notifications/unsubscribe-auth/APUJ53HOQIHSBGGCGUXSUW3ZIC2NHAVCNFSM6AAAAABJJXQLB6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCMRWGY4TCNJVG4 . You are receiving this because you authored the thread.Message ID: @.***>

redrun45 commented 3 months ago

Thé IS a thing in French (tea). to a quick check before you delete that.

Thanks, that's good to know! But here, it was short for 'Thénardier', having been truncated by accident.

peterjdann commented 3 months ago

Very good! Yes, sadly, exposing keywords is also going to expose keyblunders — there’s no denying that. I personally believe, though, the upside outweighs the downside.

On 19 Jun 2024, at 9:44 AM, redrun45 @.***> wrote:

Thé IS a thing in French (tea). to a quick check before you delete that.

Thanks, that's good to know! But here, it was short for 'Thénardier', having been truncated by accident.

— Reply to this email directly, view it on GitHub https://github.com/LibriVox/librivox-catalog/pull/231#issuecomment-2177260583, or unsubscribe https://github.com/notifications/unsubscribe-auth/APUJ53B4INR2TI2CAWUUCYTZIDA6VAVCNFSM6AAAAABJJXQLB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZXGI3DANJYGM. You are receiving this because you authored the thread.

redrun45 commented 3 months ago

Very good! Yes, sadly, exposing keywords is also going to expose keyblunders — there’s no denying that. I personally believe, though, the upside outweighs the downside.

Agreed there! When keywords are more useful (as they will be, thanks to your changes here and elsewhere!), the obviously incorrect ones will be corrected over time. But since we can't search the catalog by description, and since everything else has a very finite set of "correct" values, the keywords will remain the only "free-form" discoverability for a project. Since #225 will cover the "exact match" use-case, I'd prefer we still at least have the option of a fuzzy keyword search.

I suppose the direction I'm going is: could we do something like the "Exact match" check-box for keywords in Advanced Search? Would that be easier than a multi-keyword-autocomplete box?

peterjdann commented 3 months ago

But since we can't search the catalog by description, and since everything else has a very finite set of "correct" values, the keywords will remain the only "free-form" discoverability for a project. Since #225 will cover the "exact match" use-case, I'd prefer we still at least have the option of a fuzzy keyword search.

I suppose the direction I'm going is: could we do something like the "Exact match" check-box for keywords in Advanced Search? Would that be easier than a multi-keyword-autocomplete box?

I think an "Exact match" checkbox would be a backward step. At the moment, my code is matching all terms that begin with what the user has entered. In practice, this means that if the user enters, say, "french revolution", that term is going to show up as the first hit, but the user will see other hits as well. In effect, my current code is delivering an exact match (if there is one) but also partially fuzzy matches.

Let me run past you an approach that you might find an improvement on how I've done this at the moment.

Imagine a scenario where the user has entered the search term "roman" and this causes the following SQL to be executed:

select value as text, 'A' as priority from keywords where value like 'roman%' and not INSTR(value, ';') and not INSTR(value, ',') UNION select concat ('=== ', 'other terms containing ', '"roman"', ' ===') as text, 'B' as priority UNION select value as text, 'C' as priority from keywords where value like '%roman%' and value not like 'roman%' and not INSTR(value, ';') and not INSTR(value, ',') order by priority ASC, text ASC limit 200

I know the "limit" value here is not what we actually use, but just for the purposes of illustration right now we need a high value for "limit" ito see the effect I'm after. (Of course we would also need to strip out the "priority" values from what the SQL is returning — I'm not suggesting we're going to display those to our users).

The point is, we can separately display what are, in effect, two lists in one return set, one of which lists is "fuzzier" than the other, while in effect documenting what we're showing the user within the result set itself.

Why do this, and not just present the user with a "fully fuzzy" result set? Well, try it using the same example:

select value as text from keywords where value like '%roman%' and not INSTR(value, ';') and not INSTR(value, ',') order by text ASC limit 200

It seems to me the result set is much less helpful.

As for the

not INSTR(value, ';') and not INSTR(value, ',') why am I now proposing that?

Well, try running the same SQL but with that conditions stripped out. Turns out there really are some pretty messy entries in our keywords table that need to be rationalised. Down the track, it might be possible to write a utility page to help with that task — but for now, it might be best to hide some easily excludable bad examples from our users.

redrun45 commented 3 months ago

Just to be sure we're on the same page: I'd suggest a check-box for the actual search, not necessarily for making the keyword autocomplete suggestions more fuzzy. You make good points on autocomplete. The "prioritized" version looks excellent from where I sit, too. 😉

The part I suggest making optional is this change in Librivox_search.php. The opening comment for this PR describes that change as:

Fixing a bug in the catalog's Advanced Search function that currently results in the generation of an incorrect list of results if a user enters a multiterm keywords entry such as "political fiction" or "french revolution"

When the user intends to search for projects tagged with a particular multi-word keyword, which they've selected from the autocomplete suggestion list... then doing a fuzzy search for projects with other keywords would indeed be a bug!

But if a user intends to do a broader search across multiple relevant keywords, that fuzzy matching is the best we have at the moment. Let's say I want to study the history of France in a particular period. I select the Genre as "*Non-Fiction -> History -> Early Modern". Then instead of selecting just one keyword from a list of suggestions, I check or un-check a box, and then enter a collection of words that might appear in any number of relevant keywords. Say, "french", "france", "paris". Would it be nice if we could pick several multi-word terms like "french revolution" and "white hoods of paris"? Absolutely! But if we don't (yet) have that, we might want to keep this (otherwise buggy) code as an alternate mode.


Ok! Suggestion made. It is definitely more work, even if a check-box is the simplest thing I know to ask for. Let me know what you think.

peterjdann commented 3 months ago

Thanks for this clarification.

Can I ask if you’ve actually had a chance to play around in a system that has implemented this PR to see how quickly and easily it helps you find stuff you might be interested in? While obviously I don’t know this for sure, I can’t help wondering if your request here is based on an old way of doing things that there may not be a lot of point in clinging to if this PR and my other PR about exposing keywords on project catalog pages do make it into production. Would it REALLY be that hard to find what you’re looking for using this new approach? (My bet is that you would actually find this new way ever so much faster and easier — but then, perhaps you have played around with this for some time, as I’ve implemented it, and you really don’t find this new approach all that great?)

To be clear, you appear to be suggesting an approach that would see the autocomplete behaviour I’ve enabled “turned off” at the selection of a checkbox. You’ve not suggested what succinct form of words you’d propose to use as a label for this checkbox that is going to make clear to users what to enter in order to get what results.

Personally, I see very little value in going to all the trouble to implement the following, given how easy keywords searching is going to be without it anyway, but I can picture a “solution” that might satisfying what I personally think is an over-the-top requirement where instead of a checkbox we had a dropdown that included the following two options for keywords searching:

“Click a term matching your entry” “Type comma-separated list”

The first of these would be the default, and would result in the autocomplete dropdown behaviour I have implemented. If someone chose from the dropdown “political fiction”, they would see in their search results only projects that had “political fiction” as a keywords entry. They would not see all projects which have “fiction” as a keywords entry.

The second would allow a user to type, say,

‘french, france, paris, french revolution, white hoods of Paris, winnie-the-pooh, pooh bear’ in the same (now non-autocompleting) field

and see a search result set that included all projects that included any one of those literal comma-separated terms. It would NOT, however, include a project with the keywords ‘revolution’ or ‘bear’.

I can see that this could be technically possible — but really, is this necessary now, with all the required JavaScript and so on? Have you really found the approach I’ve implemented so frustrating and difficult to use in practice?

Out of interest, I did a check and found that of the 29,162 entries in the scrubbed database keywords table, nearly half are terms which contain a space not in the first position. It is currently impossible to conduct a straightforward keywords search using any of these terms without getting a host of what I regard as spurious results and unexpected results. It does seem to me that if the person who designed this system in the first place had intended the keyword search to operate as it does, I would have expected them to have prevented the use of space characters in keywords at the time they are first entered in the system. There’s certainly nothing in the current user interface that gives a user any clue that searching for “political fiction” is going to return every project that has the keywords “fiction”, and treating a space character as a list item separator is pretty unusual, in my experience. Given all the above, I don’t think it’s unreasonable to describe the current implementation as suffering from a bug.

On 23 Jun 2024, at 1:46 AM, redrun45 @.***> wrote:

Just to be sure we're on the same page: I'd suggest a check-box for the actual search, not necessarily for making the keyword autocomplete suggestions more fuzzy. You make good points on autocomplete. The "prioritized" version looks excellent from where I sit, too. 😉

The part I suggest making optional is this change in Librivox_search.php. The opening comment for this PR describes that change as:

Fixing a bug in the catalog's Advanced Search function that currently results in the generation of an incorrect list of results if a user enters a multiterm keywords entry such as "political fiction" or "french revolution"

When the user intends to search for projects tagged with a particular multi-word keyword, which they've selected from the autocomplete suggestion list... then doing a fuzzy search for projects with other keywords would indeed be a bug!

But if a user intends to do a broader search across multiple relevant keywords, that fuzzy matching is the best we have at the moment. Let's say I want to study the history of France in a particular period. I select the Genre as "*Non-Fiction -> History -> Early Modern". Then instead of selecting just one keyword from a list of suggestions, I check or un-check a box, and then enter a collection of words that might appear in any number of relevant keywords. Say, "french", "france", "paris". Would it be nice if we could pick several multi-word terms like "french revolution" and "white hoods of paris"? Absolutely! But if we don't (yet) have that, we might want to keep this (otherwise buggy) code as an alternate mode.

Ok! Suggestion made. It is definitely more work, even if a check-box is the simplest thing I know to ask for. Let me know what you think.

— Reply to this email directly, view it on GitHub https://github.com/LibriVox/librivox-catalog/pull/231#issuecomment-2184078253, or unsubscribe https://github.com/notifications/unsubscribe-auth/APUJ53AJDA7VP3UYCRWJ5MTZIWL4DAVCNFSM6AAAAABJJXQLB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBUGA3TQMRVGM. You are receiving this because you authored the thread.

redrun45 commented 3 months ago

Given all the above, I don’t think it’s unreasonable to describe the current implementation as suffering from a bug.

Fair enough! I'm not going to argue. I only intended to let you know of an objection I thought would likely be raised. I'll make sure folks know this PR is ready for review as-is.

peterjdann commented 3 months ago

Thank you! Can I suggest you pause any further review until I have implemented the kind of prioritised list I foreshadowed with that revised SQL? So far, that’s not yet part of this PR. I’d like to add it. Should be done within 48 hours, I expect.

On 23 Jun 2024, at 10:40 AM, redrun45 @.***> wrote:

Given all the above, I don’t think it’s unreasonable to describe the current implementation as suffering from a bug.

Fair enough! I'm not going to argue. I only intended to let you know of an objection I thought would likely be raised. I'll make sure folks know this PR is ready for review as-is.

— Reply to this email directly, view it on GitHub https://github.com/LibriVox/librivox-catalog/pull/231#issuecomment-2184270715, or unsubscribe https://github.com/notifications/unsubscribe-auth/APUJ53EORSDABAMF6RRRDLTZIYKOFAVCNFSM6AAAAABJJXQLB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBUGI3TANZRGU. You are receiving this because you authored the thread.

peterjdann commented 3 months ago

Have now changed SQL so the list of keywords presented in the dropdown shows, as first priority, all keywords that START WITH the letters the user has typed, then a text separator indicating that the entries below the separator are some terms that INCLUDE the letters the user has typed, then as last priority terms not already presented that INCLUDE the letter sequence the user has typed. The maximum number of entries shown in the dropdown is 200.

I did NOT implement a change I foreshadowed above which would have excluded entries including ";" or "," characters.

The query term is now both "like-suitable" escaped and parameterised.

To get a glimpse of the kind of tidying up work that may lie ahead (if anyone is brave enough to take it on) trying typing "libriv" in the keywords search field on a system where this change has been implemented.

redrun45 commented 3 months ago

I did NOT implement a change I foreshadowed above which would have excluded entries including ";" or "," characters. ... To get a glimpse of the kind of tidying up work that may lie ahead (if anyone is brave enough to take it on) trying typing "libriv" in the keywords search field on a system where this change has been implemented.

I didn't look to close until now, in case you intended to make that foreshadowed change. :wink: Actually, it's surprising how correlated these two things are. It would seem that any time either a ";" or "," shows up inside a keyword, it's a typo. In fact, most of those have already been fixed, and the remaining few can be done pretty easily! :partying_face:

I was going to suggest adding AND instances > 1 to the SELECT statements for autocomplete, now that I know the old, typo'd keywords are still in there. But of course, learning this, I realize that we'll need an extra line in update_keyword_stats, so that unused keywords get set to instances = 0!

I had assumed the unused keywords were deleted, so didn't even think to check that, for all my "hammering". I'll add another suggestion to #225. :sweat_smile: