USEPA / EPA_Environmental_Dataset_Gateway

U.S. EPA’s Metadata Catalog
https://edg.epa.gov
3 stars 2 forks source link

Enable fuzzy search as default search mode #4

Open torrin47 opened 7 years ago

torrin47 commented 7 years ago

From @torrin47 on February 27, 2017 19:18

Below is the email chain for context. Asking Esri for their thoughts before we get started on this. Will work to formulate more specific requirements.

From: Greene, Ana Sent: Wednesday, February 22, 2017 8:59 AM To: Hultgren, Torrin Hultgren.Torrin@epa.gov Cc: Pierson, Suzanne Pierson.Suzanne@epa.gov; Harness, Catherine Harness.Catherine@epa.gov; Suma Malothu smalothu@innovateteam.com Subject: RE: Full text search thoughts

Hi guys, Did I ever respond to this? Just catching up…only 2 weeks behind on email…

I totally agree that the wildcard and fuzzy searches should be the default. And like the advanced search dialog. I’d like to go ahead and put all of this on our list of near term development projects.

Thanks,

Ana Greene, M.S., PMP Environmental Dataset Gateway (EDG) Program Manager Office of Environmental Information (OEI) Office of Information Management (OIM) U.S. Environmental Protection Agency (o): 202-566-2132 (c): 571-232-7860 Greene.Ana@epa.gov https://edg.epa.gov/

From: Hultgren, Torrin Sent: Tuesday, February 07, 2017 7:26 PM To: Greene, Ana Greene.ana@epa.gov Cc: Pierson, Suzanne Pierson.Suzanne@epa.gov; Harness, Catherine Harness.Catherine@epa.gov; Suma Malothu smalothu@innovateteam.com Subject: Full text search thoughts

Hi Ana,

I believe I’ve figured out the source of our continuing confusion about full text search. It was legitimately disabled years ago, but has been working for some time, yet perhaps not in the way we might expect, so I think there’s still some room for improvement, or at least adjustment. I think a lot of our confusion revolves around partial search terms and whether or not they’re considered a match. I think we can all remember a time when we used to have to be very careful about our search terms, and we couldn’t assume that search engines would appropriately match partial words or misspellings, yet these days we take it for granted. Lucene is quite capable of handling any match type we want it to, but the default is the old strict way. If we do a search for the first part of your email address, by default it will come up blank, even though there are records containing your email address:

https://edg.epa.gov/metadata/rest/find/document?f=searchpage&searchText=greene.ana

EDG has “advanced Lucene syntax” if anyone chose to read the help, and could apply a wildcard to their search, which just means that indexed terms that aren’t exact matches but contain the string are returned:

https://edg.epa.gov/metadata/rest/find/document?f=searchpage&searchText=*greene.ana*

Which gives us all 6 records that contain your email address. In theory this slows performance, but we’d need orders of magnitude more records in our index before we’d notice any difference. There’s a last option that’s kind of fun – though it doesn’t seem to work with the direct link, so you’ll have to try it manually If you do a search for greene.ana~ it will conduct a “fuzzy search”, where it will include “misspellings” or words that are very similar – it should return a bunch of records with “Greenspace” in the title.

I’m not sure about you, but I think my own expectation these days is that wildcards and fuzzy searches would be the default – I’d prefer a search to return too many results that I could filter through or refine than too few. But that may also because of an assumption that the search engine would do a good job of ranking/sorting those results so the most relevant ones would appear first, and I don’t know how valid an assumption that is with the EDG. I think we could figure out how to adjust the scoring/ranking algorithm under the hood of the EDG, but I’m not at all sure how we’d measure whether our tweaks were making search results more or less relevant. And if we were to make fuzzy searches the default, I wonder how we’d allow someone to opt-out if they wanted a more strict match? Perhaps we could show an “advanced search” dialog if they wished:

http://www.lucenetutorial.com/lucene-query-builder.html https://www.google.com/advanced_search

Anyway, curious to know your thoughts. Definitely been on the brain today. Torrin Hultgren EPA National Geospatial Support Team Innovate!, Inc. | hultgren.torrin@epa.gov | 703-922-9090 x737

_Copied from original issue: Innovate-Inc/EDGmetadata#69

torrin47 commented 5 years ago

Created ticket to solicit input from Esri: https://github.com/Esri/geoportal-server/issues/308

aergul commented 5 years ago

@torrin47 we could include a dropdown, for example, that provides different search mechanisms (default/strict?, fuzzy, wildcard) and morph the search term(s) without the user having to insert special characters.

Having said that, I experimented against EDG with those options and, frankly, I wasn't impressed by fuzzy search at all. It didn't appear to work as advertised. Have you tried fuzzy search lately?

Wildcard searches seemed to catch additional records but it wasn't always possible for me to rationalize why certain wildcard searches returned the records they did. All in all, it seems to help but sometimes in strange ways.

torrin47 commented 5 years ago

I'm not sure what's up, but the only way I was able to get fuzzy search to work was from the main page: https://edg.epa.gov/metadata/catalog/main/home.page If I try the same search on the advanced search page: https://edg.epa.gov/metadata/catalog/search/search.page the tilde seems to be ignored. So I guess step 1 on this ticket would be debugging the current search page.

I like the suggestion of a dropdown.

I agree it's really tough to understand search results without some sort of highlighting indicating the matched search term (something we discussed in the context of using expanded search with related terms). Highlighting would be awesome, but probably too heavy a lift to justify at this point in the GeoPortal Server lifecycle.

My experience has been that strict match searches tend to generate more discouragement among users (when no results are shown) than expanded searches. Users would prefer to see many results and opt to filter from there than to see too few results and wonder why.

aergul commented 5 years ago

Very odd but that is indeed correct, only main page executes fuzzy search correctly. I had only use the search page when investigating this. I have seen some signs of post-submission modification of the search, maybe it gets in the way of fuzzy search. Will look closer...