IUBLibTech / newton_chymistry

New version of 'The Chymistry of Isaac Newton', using XProc pipelines to generate a website based on TEI XML encodings of Newton's alchemical manuscripts, and Apache Solr as a search engine.
2 stars 0 forks source link

Highlighting of phrase search results seems to be applied for individual words of the phrase #101

Open tubesoft opened 3 years ago

tubesoft commented 3 years ago

As we discussed in the previous meeting (Feb 16), I firstly thought that, OR search would be executed when we enter a phrase like Aqua fortis in the search form. However, I found that the webapp actually execute an phrasal search! For the evidence, when I search the word, "Aqua fortis," (without double-quotes in the form) I get 17 results.

image

I also executed the query, http://localhost:8983/solr/chymistry/select?q=text%3A%22Aqua%20fortis%22 (or text:"Aqua fortis") directly on Solr. The phrase was actually double-quoted, which triggers phrasal search in Solr. I also got 17 results!

image

On the other hand, when I executed text:Aqua fortis (without double-quotes), I got 83 results.

Judging from these facts, when we enter a phrase in the search form, a phrasal search seems to be executed even without double-quotation.

Then, I assume the issue to think about is how to highlight the result. The search result highlights the phrases that are hit as well as individual words of the phrase. I might have to talk at the meeting about whether only the exact phrase should be highlighted or not, but I personally highlighting should be limited to the exact phrase since it is the result of a phrasal search.

If we decided to modify the highlighting results, we might have to ask @Conal-Tuohy 's help!

tubesoft commented 3 years ago

Hello, @Conal-Tuohy In the previous meeting (2/23), we discussed text search issues in detail. I listed patterns of searching and the issues. Also, I wonder if I can/should work on these issues, getting some advice from @Conal-Tuohy , or I should just ask @Conal-Tuohy to fix the code.

The list consists of search words on the web page, the current result on the web page, and how we want to fix them.

Search field: "aqua fortis" Result: Double-quotes are recognized as letters, by which we cannot get any results. Fix: q=text:"aqua fortis" should be executed. See also the next Fix.

Search field: aqua fortis Result: Phrasal search is executed. Equivalent in the Solr query, q=text:"aqua fortis" Fix: Result highlighting does not only apply to the phrase itself, but also each word of the phrase individually. The individually highlighted words also seem to come from a certain length of lines that all the words of which the phrase consist are included. We would prefer that the searching experience is close to that of google, so double-quoting the phrase like above should execute a strict phrasal search, and no quotation execute an OR search, prioritizing phrasal matches. As for the strict phrasal search, highlighting should apply only to the phrase itself.

Search field: aqu* Result: Wildcard search is enabled Fix: If possible, implementing Latin dictionary search would be preferable to facilitate searching variety of Latin declension and conjugation, but as long as I researched, there is no Latin search mode on Solr.

Search field: *qua Result: Wildcard search is enabled Fix: No need to fix.

Search field: aqu* fortis Result: wild card search is NOT enabled, and we cannot get any results. Fix: As far as I could find, Solr does not basically allow phrasal search with wildcard. We might need some workaround.

Search result: aqu* fort* Result: Wild card search is NOT enabled, and we cannot get any results. Fix: Same as above.

Search field: (a phrase including symbols) Result: It seems to behave the same as searching only with normal words. Fix: We might have to make sure if it works once the issues above are resolved.

Conal-Tuohy commented 3 years ago

As as see it there are a few different issues here:

  1. Latin stemming
  2. Wild-card and phrasal search
  3. Highlighting

Latin stemming

It would be possible to patch Solr to be Latin-aware, and automatically extract the stems of latin words, which would allow for fuzzier searches. This would require writing a Latin "Stemmer" component in Java, and configuring Solr to use it. I know Java and I could write such a thing, but I would definitely need assistance with Latin grammar because I don't know Latin. It' would be many hours' work though, for sure.

Wild-card and phrasal search

The app uses Solr's facet search API. The app accepts the field values posted from the website's HTML form, and it uses an XSLT stylesheet to convert those values into a JSON object which it sends to Solr as an HTTP POST, and then uses another stylesheet to convert Solr's response back to HTML. The stylesheet which formats the request as JSON is https://github.com/IUBLibTech/newton_chymistry/blob/master/xslt/search-parameters-to-solr-request.xsl and the search parameters are passed to it in XML like this:

<c:param-set>
   <c:param name="text" value="aqua fortis"/>
   <c:param name="symbol" value="♂ IRON (MARS)"/>
</c:param-set>

The stylesheet produces a JSON document which specifies a search query for anything at all (i.e. *:*) and then refines it with additional filter queries for each of the fields which the user specified. Those filter queries are all phrasal searches because the stylesheet wraps single quotes around each parameter value, but it should be enough to just remove those quotes, I think, and that would allow users to either enter quotes around their search terms, or not, as they prefer. If they don't use quotes, then Solr's fuzzy search operators will be available to them as well, and it'd probably worth adding some "Search tips" to the search form to explain that extended search syntax. See https://lucene.apache.org/solr/guide/7_7/the-standard-query-parser.html

Highlighting

What's happening with the hit highlighting is that Solr's response includes a set of what it calls "snippets"; these are phrases or sentences in which the searched words appear. Each snippet represents a "hit", and within each snippet, the words that the user searched for are also marked up. The snippets are displayed in full on the search results page, with the keywords in each snippet highlighted. If you click through to the HTML page, the same snippets are retrieved from Solr and used to highlight the HTML. Each snippet is highlighted, and the keywords are also highlighted distinctly. The stylesheet which does the hit-highlighting also inserts next and previous links to the next and previous snippet. The actual formatting of the two types of highlight is determined by CSS. Note that the keywords which the user sought may not be contiguous if they're not doing a phrasal search, so it would help, I think, to retain some kind of highlighting for the entire matching snippet.

I think it'd be a good idea to defer any work on the hit-highlighting until the phrase searching is replaced with word searching and fuzzy searching enabled, because that will change the kind of results you can get. It should be a separate issue.

tubesoft commented 3 years ago

@Conal-Tuohy Thank you so much for the detailed explanation!

As for Latin stemming, since I used to be a Java web developer as well as I know basic Latin, I also might be able to work on it. Even if I do so, I would ask you to provide some information resources about this technology.

As for wildcard and phrasal search, I am going to try what you suggest and see the difference.

As for highlighting, okay, we can work on it after the two issues above are done.

If I have further questions, please let me inquire again.

Conal-Tuohy commented 3 years ago

@tubesoft since I am looking into the phrase search issue myself, for @jawalsh, maybe you ought to wait a bit?

Regarding lemmatizing Latin in Solr, I don't actually know much about it myself, to be honest. The best I can do is refer you to https://lucene.apache.org/solr/guide/8_1/language-analysis.html

tubesoft commented 3 years ago

@Conal-Tuohy Yes, I can wait! Please let me know once you are done with it.

Okay, as for the Latin issues, I will try to do some research for how to implement it.

Conal-Tuohy commented 3 years ago

progress report: I've fixed the "phrasal search" part of this issue in the Swinburne website, but I haven't merged the fix with chymistry, yet. I've been almost entirely off work for a couple of weeks because of a virus, and I'm just getting back to it.

tubesoft commented 3 years ago

Thank you, Con! I am currently struggling with a basic Solr issue, that is, adding a JAR plugin to my local Solr server. Although I think I managed to solve issue by myself, I inquired on Stackoverflow. Next step will be figuring out how to implement in the newton_chymistry project.

tubesoft commented 3 years ago

Here is update! I added Latin stemmer plugins found in this GitHub repository on my local Solr, and I modified the filters property in update-schema-from-field-definitions.xsl to call those java class files.

Then, after re-indexing, the search seems to recognize latin word. For example, when I search "omnis", the result shows various declensions.

image image
tubesoft commented 3 years ago

However, the stemmer sometimes recognize English words too like the following image. I wonder if we also need to add language recognition feature.

image