Innovate-Inc / EDG_metadata

EDG metadata on staging created for Innovate-Inc
0 stars 1 forks source link

Enable fuzzy search as default search mode #69

Closed torrin47 closed 7 years ago

torrin47 commented 7 years ago

Below is the email chain for context. Asking Esri for their thoughts before we get started on this. Will work to formulate more specific requirements.

From: Greene, Ana Sent: Wednesday, February 22, 2017 8:59 AM To: Hultgren, Torrin Hultgren.Torrin@epa.gov Cc: Pierson, Suzanne Pierson.Suzanne@epa.gov; Harness, Catherine Harness.Catherine@epa.gov; Suma Malothu smalothu@innovateteam.com Subject: RE: Full text search thoughts

Hi guys, Did I ever respond to this? Just catching up…only 2 weeks behind on email…

I totally agree that the wildcard and fuzzy searches should be the default. And like the advanced search dialog. I’d like to go ahead and put all of this on our list of near term development projects.

Thanks,

Ana Greene, M.S., PMP Environmental Dataset Gateway (EDG) Program Manager Office of Environmental Information (OEI) Office of Information Management (OIM) U.S. Environmental Protection Agency (o): 202-566-2132 (c): 571-232-7860 Greene.Ana@epa.gov https://edg.epa.gov/

From: Hultgren, Torrin Sent: Tuesday, February 07, 2017 7:26 PM To: Greene, Ana Greene.ana@epa.gov Cc: Pierson, Suzanne Pierson.Suzanne@epa.gov; Harness, Catherine Harness.Catherine@epa.gov; Suma Malothu smalothu@innovateteam.com Subject: Full text search thoughts

Hi Ana,

I believe I’ve figured out the source of our continuing confusion about full text search. It was legitimately disabled years ago, but has been working for some time, yet perhaps not in the way we might expect, so I think there’s still some room for improvement, or at least adjustment. I think a lot of our confusion revolves around partial search terms and whether or not they’re considered a match. I think we can all remember a time when we used to have to be very careful about our search terms, and we couldn’t assume that search engines would appropriately match partial words or misspellings, yet these days we take it for granted. Lucene is quite capable of handling any match type we want it to, but the default is the old strict way. If we do a search for the first part of your email address, by default it will come up blank, even though there are records containing your email address:

https://edg.epa.gov/metadata/rest/find/document?f=searchpage&searchText=greene.ana

EDG has “advanced Lucene syntax” if anyone chose to read the help, and could apply a wildcard to their search, which just means that indexed terms that aren’t exact matches but contain the string are returned:

https://edg.epa.gov/metadata/rest/find/document?f=searchpage&searchText=*greene.ana*

Which gives us all 6 records that contain your email address. In theory this slows performance, but we’d need orders of magnitude more records in our index before we’d notice any difference. There’s a last option that’s kind of fun – though it doesn’t seem to work with the direct link, so you’ll have to try it manually If you do a search for greene.ana~ it will conduct a “fuzzy search”, where it will include “misspellings” or words that are very similar – it should return a bunch of records with “Greenspace” in the title.

I’m not sure about you, but I think my own expectation these days is that wildcards and fuzzy searches would be the default – I’d prefer a search to return too many results that I could filter through or refine than too few. But that may also because of an assumption that the search engine would do a good job of ranking/sorting those results so the most relevant ones would appear first, and I don’t know how valid an assumption that is with the EDG. I think we could figure out how to adjust the scoring/ranking algorithm under the hood of the EDG, but I’m not at all sure how we’d measure whether our tweaks were making search results more or less relevant. And if we were to make fuzzy searches the default, I wonder how we’d allow someone to opt-out if they wanted a more strict match? Perhaps we could show an “advanced search” dialog if they wished:

http://www.lucenetutorial.com/lucene-query-builder.html https://www.google.com/advanced_search

Anyway, curious to know your thoughts. Definitely been on the brain today. Torrin Hultgren EPA National Geospatial Support Team Innovate!, Inc. | hultgren.torrin@epa.gov | 703-922-9090 x737

torrin47 commented 7 years ago

This issue was moved to USEPA/EPA_Environmental_Dataset_Gateway#4