HistoryAtState / hsg-shell

Source code for the history.state.gov website
https://history.state.gov
9 stars 13 forks source link

Assemble list of quirks with our full text search engine (esp. punctuation) #255

Open joewiz opened 8 years ago

joewiz commented 8 years ago

@HistoryAtState/editors: Please post any more examples you know of, and we'll work with @HistoryAtState/existsolutions to try out different Lucene analyzers and/or add advanced search form controls to see what combination can produce our expected results.

joewiz commented 7 years ago

See https://github.com/HistoryAtState/frus/commit/fc3f8b36d9cda22e7771fe0786b909548927a285 for an experiment with stopwords.

joewiz commented 7 years ago

For more on stopwords, see:

joewiz commented 7 years ago

Regarding the banquo's example above, I did some experiments on this some months ago:

Related issues:

joewiz commented 6 years ago

Search scope, index configuration, and plumbing issues

Search omissions

joewiz commented 6 years ago

From Michael McCoyer:

Sending you another search “quirk” I just encountered (I hope when you encouraged me to send these along that you really meant it!) –

So, our common terminology in FRUS headers for the National Security Advisor is “President’s Assistant for National Security Affairs.” When I searched that syntax as a phrase in Kristin’s Human Rights volume (https://history.state.gov/historicaldocuments/frus1977-80v02), however, I received no hits. The phrase does, however, appear in a number of headers in the volume – e.g., Docs 4 and 16.

Is it possible that the search queries don’t search on headers, or that they search them differently? I will leave this puzzle in your capable hands…

The exact URL for his search that returned zero hits is:

https://history.state.gov/search?q=%22president%27s+assistant+for+national+security+affairs%22&volume-id=frus1977-80v02

However, I found that if I changed the apostrophe in "president's" from straight (') to curly (), the expected hits suddenly appeared:

https://history.state.gov/search?q=%22president%E2%80%99s+assistant+for+national+security+affairs%22&volume-id=frus1977-80v02&within=documents&sort-by=relevance

Thus, it appears that our search engine treats the curly quote as a literal character - like a letter in a word - rather than as punctuation that should be dropped. We need to get the search engine to treat curly quotes as straight quotes.

joewiz commented 5 years ago

From @joshbotts via the mailbox:

Wanted to flag this mailbox inquiry as an instance where a case-sensitive search capability would come in handy. Searching for "goa" returns several hundred hits, but most of them are for abbreviations for "Government of ..." rather than the geographical entity in South Asia.

--> User story: I want to search for "Goa" and exclude hits that are upper case ("GOA")

plutonik-a commented 5 years ago

This comment https://github.com/HistoryAtState/hsg-shell/issues/255#issuecomment-476926820 has been already issued here -> https://github.com/HistoryAtState/hsg-shell/issues/289

plutonik-a commented 5 years ago

This issue has been spliced into different existing and new issues (including backlinks to this one):

Therefore closing this parent issue.

marmoure commented 2 years ago

@joewiz for searching for the term s/s we can escape the forward slash with a backward slash s\/s and this will pass without the nasty lucene error