Police-Data-Accessibility-Project / data-sources-app

An API and UI for using and maintaining the Data Sources database
MIT License
2 stars 5 forks source link

Pluralization of words in multi-word searches will have competing results #350

Open EvilDrPurple opened 3 months ago

EvilDrPurple commented 3 months ago

Context

Multi-word searches that pluralize words other than the last word in the search will sometimes cause competing results between the unaltered search and the de-pluralized search, meaning some results will not be displayed. The de-pluralized search attempts to search a second time where the words are made singular to try and find more results. For example, searching uses of force in madison returns one result:

[!NOTE] This search is currently only possible in v2 while the fixes to quicksearch casing are pending for v1, but will be possible once those are merged. At present, searching the below terms will return no results

image This looks good at first glance, however the backend has actually found two results: [('reckSg7rw3raeGvP2', 'Archived 21st Century Policing Quarterly Data', 'Summarized data about incident-based reporting, arrests, personnel demographics, traffic stops, and uses of force.\n', 'Annual & Monthly Reports', 'https://www.cityofmadison.com/police/data/archived-quarterly-data.cfm', '["PDF: Machine Created", "XLS"]', datetime.date(2016, 1, 1), None, True, 'Madison Police Department - WI', 'Madison', 'WI')] [('recL8nSiM0HsIOaGN', 'Use of Force Policy', None, 'Policies & Contracts', 'https://public.powerdms.com/HSVPS/tree/documents/40', None, None, None, True, 'Huntsville Police Department - AL', 'Huntsville', 'AL')] The first one is found by the unaltered search, while the second is found by the de-pluralized search. Since only one (the longest one) is kept, this means the second one is discarded in favor of the first. (The reason the second one comes up is that Huntsville is located in Madison County) This is a smaller scale example for what may be happening in some other, larger searches, we can probably easily combine the two lists coming from the backend and remove duplicates instead of discarding one

Requirements

Open questions

josh-chamberlain commented 3 months ago

this is good to know about but tricky to fix. in v2, quick search (and searching on strings in general) will have greatly reduced impact on the results people are able to get. So that's good, and minimizes the impact of details like this.

Since only one (the longest one) is kept, this means the second one is discarded in favor of the first.

Can you say what you mean by this?

EvilDrPurple commented 3 months ago

Can you say what you mean by this?

@josh-chamberlain In the code currently, we perform a search of the dataset twice: the first time using the search terms exactly as written, and the second after "depluralizing" the search terms. This means two different sets of results are returned, only the largest one is selected for displaying while the smaller result is discarded. Hope that clears it up a bit

josh-chamberlain commented 2 months ago

@EvilDrPurple oh, I see—I was overthinking it. Yeah, we should probably combine the results and show them all.