LibraryOfCongress / chronam

This software project is no longer being actively developed at the Library of Congress. Consider using the Open-ONI (https://github.com/open-oni) fork of the chronam software. Project mailing list: http://listserv.loc.gov/archives/chronam-users.html.
71 stars 34 forks source link

Examples of hit highlighting not present on thumbs or page view #53

Closed nwy closed 6 years ago

nwy commented 11 years ago

This bug does not seem related to punctuation or capitalization as ticketed in https://github.com/LibraryOfCongress/chronam/issues/35

Sample Search results page: Note 4th thumb with no highlights

http://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1836&date2=1922&proxtext=stocktonian&x=0&y=0&dateFilterType=yearRange&rows=20&searchType=basic

And page view also with no highlights:

http://chroniclingamerica.loc.gov/lccn/sn85066387/1912-04-18/ed-1/seq-1/#date1=1836&index=3&rows=20&words=Stocktonian&searchType=basic&sequence=0&state=&date2=1922&proxtext=Stocktonian&y=19&x=14&dateFilterType=yearRange&page=1

Another sample search:

http://chroniclingamerica.loc.gov/search/pages/results/?date1=1836&rows=20&searchType=basic&state=&date2=1922&proxtext=the+call+leads+in+political&y=0&x=0&dateFilterType=yearRange&page=1&sort=relevance

keshavmagge commented 10 years ago

@dbrunton @nwy I think this is an edge case where the search results returned by SOLR contains matches that are a result of applying the concept of fuzzy searches (http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Fuzzy%20Searches). In the example here, the search is for term "stocktonian". One of the pages returned in response contains the term "Stockton" (California). The piece of code that highlights the search terms on the returned pages however, is not smart enough to catch that and highlight "Stockton". Does that make sense?

We can either stop including hits (by not doing fuzzy searches) in the response that contain terms that look like the search term and eliminate cases where search hits have no highlighting (greatly reduces search flexibility, I dont recommend we do this) or live with the odd missing highlights or look deeper and try to find a better solution (I recommend we do this :) ). Thoughts?

keshavmagge commented 10 years ago

No, sorry, let me back up one step. Here's another search scenario. I searched for the word "Institutional" and one of the hits did not have any highlighting. Here's the url to that search hit - http://chroniclingamerica.loc.gov/lccn/sn83045487/1916-03-04/ed-1/seq-32/#date1=1836&sort=relevance&rows=20&words=Institutional&searchType=basic&sequence=0&index=7&state=&date2=1922&proxtext=Institutional&y=-218&x=-1065&dateFilterType=yearRange&page=2

However, if you change the 'words' and 'protext' request parameters in the url to something that we know is on that page ('lecture' for example) so the url looks like this
http://chroniclingamerica.loc.gov/lccn/sn83045487/1916-03-04/ed-1/seq-32/#date1=1836&sort=relevance&rows=20&words=lecture&searchType=basic&sequence=0&index=7&state=&date2=1922&proxtext=lecture&y=-218&x=-1065&dateFilterType=yearRange&page=2

This works ok. Highlighting is all good. This makes me wonder if we have corrupted ocr xmls for some pages? Im just thinking out of my, hat. Still investigating.

nwy commented 10 years ago

@keshavmagge It seems to me that since the SOLR upgrade and release a few weeks ago, the bug I identified in this ticket has been fixed. I now see highlights on thumbs and page images for the examples I posted.

The punctuation bug still exists however and that is documented in ticket #35 .

I wonder if your search example of "institutional" falls under the category of ticket #35. When you view the OCR text of that page, there is an open quote character attached to the word institutional?

keshavmagge commented 10 years ago

All the missing highlighting scenarios I have encountered thus far are a result of ill-formed OCR, if you will. Here's an example (Notice the word 'stocktonian') - http://dpaste.com/1351938/

You would imagine that punctuation marks are stripped from lexemes before they make it to the OCR. However, there are many such punctuated words littered across many OCRs. we could get an ugly fix in to handle scenarios where leading/trailing quotes could be handled gracefully. However, that is not a fool proof fix for other punctuation marks that me be sitting in these OCRs.

dbrunton commented 10 years ago

In a few cases, yes. In a few cases, no. If you look at a search for "avant garde" you will notice the second result has a hit highlight, but when you click through, it's missing: http://chroniclingamerica.loc.gov/lccn/sn86079080/1912-02-10/ed-1/seq-4/#date1=1836&index=1&rows=20&words=avant+garde+l%27avant-garde&searchType=basic&sequence=0&state=&date2=1922&proxtext=avant+garde&y=0&x=0&dateFilterType=yearRange&page=1

What's the difference in codepath for the thumbnails and the pages?

keshavmagge commented 10 years ago

@dbrunton Its the hyphen this time (http://dpaste.com/1354428/ line 4). Its interesting though that highlighting works on the search results page. This highlighting bug is getting more and more intricate.

keshavmagge commented 10 years ago

Here's my compiled list of observations so far.

When we search for matches on SOLR, we employ fancy proximity searches, fuzzy search and boosting a term to pull up results.

for example, borrowing the search word from @dbrunton 's last commit, when we search for "avant garde", the search query for SOLR that the code generates is u'+type:page +date:[18360101 TO 19221231] +((ocr:("avant garde"~5)^10000 ) OR ocr_eng:"avant garde"~5 OR ocr_fre:"avant garde"~5 OR ocr_spa:"avant garde"~5 OR ocr_ger:"avant garde"~5 )'

notice the proximity (~) and term boosting (^) at work. If we were to rewrite this query to not employ all this tricks, like so u'+type:page +date:[18360101 TO 19221231] + ocr:("avant garde") OR ocr_eng:"avant garde" OR ocr_fre:"avant garde" OR ocr_spa:"avant garde" OR ocr_ger:"avant garde" )' , this doesnt even return the hit that was missing highlighting in the first case.

What Im trying to get to is, we pull up matching pages that contain text that is close to the search term and ask the javascript to highlight the original search term for us, which doesn't even exist in the ocr to begin with. Javascript pulls up word coordinates for a page by making a request to .../coordinates/ end point which looks like this - http://chroniclingamerica.loc.gov/lccn/sn86079068/1893-10-14/ed-1/seq-1/coordinates/

Does all this make sense? I think this is causing javascript highlighting to give up on some pages.

keshavmagge commented 10 years ago

@dbrunton We could perhaps use regex to handle scenarios where highlighting is missing because of leading/trailing quotes in ocr words by using regex here - https://github.com/LibraryOfCongress/chronam/blob/master/core/static/js/highlight.js#L23

However, that is not fool proof either. There will always be search strings that solr matches but highlighting would fail due to ill-constructed ocr. Thoughts?

keshavmagge commented 10 years ago

@dbrunton @eikeon apparently there was already a regex to filter trash out of lexemes before compiling the word co-ordinates, however that was only filtering out trailing punctuation marks, I corrected it to toss leading punctuation marks and trailing apostrophes https://github.com/LibraryOfCongress/chronam/commit/283d9cea82915cd000da213f85fb72e94079435d

Should fix majority of the highlighting anomalies. However, some search scenarios still slip through the cracks, like ones for terms that contain punctuation (hyphens) in the middle of the term or accent notations.

keshavmagge commented 10 years ago

This was the regex we had earlier - https://github.com/LibraryOfCongress/chronam/commit/8431f85c646c254692d073dced160b3ea0bf5193#diff-3d73654bdbd8b7bca52b2e13383a2f94R5

nwy commented 7 years ago

hcar and I did some more testing on the hit highlighting after 3.10 was deployed to ndnpqrvlp01.

If a special character is located in the middle of the word, the search engine will find it and chronam will highlight it. If the character is at the end of the word, the search engine will find it but chronam will not highlight it.

Changing the character to the Roman counterpart does not change that. This is true in multiple languages.

  1. In Polish the term którzy is found and highlighted https://chroniclingamerica.loc.gov/search/pages/results/?lccn=sn90060821&lccn=sn90060823&lccn=sn90060824&dateFilterType=yearRange&date1=1886&date2=1922&language=&ortext=kt%C3%B3rzy&andtext=&phrasetext=&proxtext=&proxdistance=5&rows=20&searchType=advanced

  2. With the Polish term zawrzeć the term is found but not highlighted. https://chroniclingamerica.loc.gov/search/pages/results/?state=&lccn=sn90060821&lccn=sn90060823&lccn=sn90060824&dateFilterType=yearRange&date1=1886&date2=1922&language=&ortext=zawrze%C4%87&andtext=&phrasetext=&proxtext=&proxdistance=5&rows=20&searchType=advanced

  3. In Icelandic the term verða is found and highlighted https://chroniclingamerica.loc.gov/search/pages/results/?date1=1886&date2=1922&searchType=advanced&language=&lccn=sn90060662&proxdistance=5&state=&rows=20&ortext=ver%C3%B0a&proxtext=&phrasetext=&andtext=&dateFilterType=yearRange&page=1&sort=relevance

  4. The term mikið is found but not highlighted https://chroniclingamerica.loc.gov/search/pages/results/?state=&lccn=sn90060662&dateFilterType=yearRange&date1=1886&date2=1922&language=&ortext=miki%C3%B0+&andtext=&phrasetext=&proxtext=&proxdistance=5&rows=20&searchType=advanced

  5. With Finnish the search is a little more mixed: The term sekä does produce some highlights but not every time. http://chroniclingamerica.loc.gov/search/pages/results/?state=&lccn=2011260133&dateFilterType=yearRange&date1=1789&date2=1924&language=&ortext=sek%C3%A4&andtext=&phrasetext=&proxtext=&proxdistance=5&rows=20&searchType=advanced

6 The term selittävät produces highlights for each page http://chroniclingamerica.loc.gov/search/pages/results/?state=&lccn=2011260133&dateFilterType=yearRange&date1=1789&date2=1924&language=&ortext=selitt%C3%A4v%C3%A4t++&andtext=&phrasetext=&proxtext=&proxdistance=5&rows=20&searchType=advanced