Open joewiz opened 8 years ago
See https://github.com/HistoryAtState/frus/commit/fc3f8b36d9cda22e7771fe0786b909548927a285 for an experiment with stopwords.
For more on stopwords, see:
Regarding the banquo's
example above, I did some experiments on this some months ago:
Related issues:
Search scope, index configuration, and plumbing issues
Search omissions
From Michael McCoyer:
Sending you another search “quirk” I just encountered (I hope when you encouraged me to send these along that you really meant it!) –
So, our common terminology in FRUS headers for the National Security Advisor is “President’s Assistant for National Security Affairs.” When I searched that syntax as a phrase in Kristin’s Human Rights volume (https://history.state.gov/historicaldocuments/frus1977-80v02), however, I received no hits. The phrase does, however, appear in a number of headers in the volume – e.g., Docs 4 and 16.
Is it possible that the search queries don’t search on headers, or that they search them differently? I will leave this puzzle in your capable hands…
The exact URL for his search that returned zero hits is:
However, I found that if I changed the apostrophe in "president's" from straight ('
) to curly (’
), the expected hits suddenly appeared:
Thus, it appears that our search engine treats the curly quote as a literal character - like a letter in a word - rather than as punctuation that should be dropped. We need to get the search engine to treat curly quotes as straight quotes.
From @joshbotts via the mailbox:
Wanted to flag this mailbox inquiry as an instance where a case-sensitive search capability would come in handy. Searching for "goa" returns several hundred hits, but most of them are for abbreviations for "Government of ..." rather than the geographical entity in South Asia.
--> User story: I want to search for "Goa" and exclude hits that are upper case ("GOA")
This comment https://github.com/HistoryAtState/hsg-shell/issues/255#issuecomment-476926820 has been already issued here -> https://github.com/HistoryAtState/hsg-shell/issues/289
This issue has been spliced into different existing and new issues (including backlinks to this one):
Therefore closing this parent issue.
@joewiz for searching for the term s/s
we can escape the forward slash with a backward slash
s\/s
and this will pass without the nasty lucene error
Banquo
will not yield documents containingBanquo's
orBanquo’s
. (Lucene's Standard Analyzer doesn't treat an apostrophe (straight or curly) as a word marking boundary, following the Unicode word boundary specification rules.)f-16
(the airplane) is effectively identical to a search forf 16
- and will yield documents containingf
or16
. Making this a phrase search"f-16"
helps, but returns results without the hyphen likef 16
. Same with the proximity search,"f 16"~0
.A search for an acronym with a slash returns a Lucene parsing error. For example,
s/s
(Office of the Secretariat Staff, Department of State) error:A workaround is to use phrase or proximity search, but again we would get undesired results, e.g., U.S.S.R. and U.S.S. are matches for
"s/s"
.@HistoryAtState/editors: Please post any more examples you know of, and we'll work with @HistoryAtState/existsolutions to try out different Lucene analyzers and/or add advanced search form controls to see what combination can produce our expected results.