v2: main page/search - Githubissues

josh-chamberlain commented 4 months ago

part of #248

see wireframe

Front end

[x] Search/main page replaces data-sources.pdap.io and the gold box on pdap.io
- search options are now a bit refined; coarse selection of record types instead of an entry field
[x] use typeahead to help disambiguate place names. Treating places as objects people can predictably search, not strings we have to try to match.
- we can source the typeahead list from our states and counties table, with locality from the agencies table.
- ideally this is one magical search box which could help people select a state, county, or locality
- trigger after 2 characters (https://en.wikipedia.org/wiki/Ai,_Ohio)
- debouncing as discussed below
- order:
  - matches that start with the string, followed by contains (if the endpoint supports that)
  - e.g. search for Al yields Alabama before Topal County
  - alphabetically
  - state, county, locality (if any share the same name)
- if there are no typeahead results: the option should be No results found. Search for "string"? and load quick search results

API changes

[x] #345
- rate limiting as discussed below
[x] #352
[x] #212

mbodeantor commented 4 months ago

This will be a simplified version of the current quick_search_query. All the potential fields for trying to match a search term in the where clause will be replaced by record_type = '{coarse_record_type}'

maxachis commented 3 months ago

This will be a simplified version of the current quick_search_query. All the potential fields for trying to match a search term in the where clause will be replaced by record_type = '{coarse_record_type}'

@josh-chamberlain @mbodeantor So this will be replacing the quick search and associated API endpoint?

Additionally, will users only be able to search for one coarse record type at a time, or multiple?

mbodeantor commented 3 months ago

@maxachis In order to maintain backwards compatibility, I would suggest a new endpoint with similar functionality

josh-chamberlain commented 3 months ago

@maxachis let's let them multi-select coarse record types, why not? Even if the UI doesn't support it. Agreed about keeping the quick search endpoint—we are likely still going to have that option

maxachis commented 3 months ago

@josh-chamberlain @mbodeantor PR has been created, incorporating both the API changes described here and #258 (which as I indicated there, is a very trivial add and is quick to reverse if nonfunctional). Next questions:

Should I also develop the frontend component as well, or have someone else do that part (it will take me time to figure out how the frontend works)
There is a quick_search_query_logs table. Should there be a search_query_logs table as well, or a table that consolidates both Quick Search and Search?
How to test the api call is working as expected? I know it was discussed that there were ways to point the api towards a dev environment, but I don't know how to do that.

josh-chamberlain commented 3 months ago

@maxachis

Joshua G is the front end lead, so I would like to let him do that part.
I don't have a strong opinion. Adding to the same table is simple in a way, but we should keep track of which endpoint they're hitting so we know the difference between quick and other searches.
I know about this in general terms, but can't give you super specific instructions about routing

mbodeantor commented 3 months ago

@maxachis Any code you merge into the dev branch will get deployed in the dev instance of the API on DigitalOcean. It uses the same database, but is otherwise separate from the live prod app

maxachis commented 3 months ago

@mbodeantor Once I get further along in setting up the dev database, it may be prudent to switch the dev instance to point to the dev database. That probably also encourages us to copy the data as well as the schema to the dev database, to better emulate it.

joshuagraber commented 2 months ago

Hey @josh-chamberlain I'm pretty close to starting on the updated UI here. A couple questions:

Are the backend changes being handled by someone else, or should I plan to work on those here, too?

we should structure the URL like /search/{record_type}/{state}/{county}/{muni}

Would it make more sense to pass record_type, state, county, and municipality, as well as searchId, as data in the request body, rather than trying to add all of them to the URL string? Particularly since all of those fields could be null when we pass a saved search searchId, it seems less complex to use the request body here.

For instance, if we hit /search/ohio or /search/9b7ddde7-5871-4f45-a8a1-3dce52847736, the API would need to check the value against some static list of states to know whether state or searchId were being passed here, no?

josh-chamberlain commented 2 months ago

@joshuagraber nice!

yes, @maxachis is working on the back end, potentially with help from others. If you would like to be involved, that can work, as the front end is the primary consumer of the API. It's mostly happening on this fork: https://github.com/Police-Data-Accessibility-Project/data-sources-app-v2
Good point about the URL structure.
- For the record, the "saved searches" concept is not part of v2, so we don't need to account for complex searches with an ID attached. If we did, maybe we could use search/id/123?
- Ideally, we'd still surface the location in the URL, especially when someone's just viewing the page of a given geography. The URLs should ideally be simple and shareable, like search/oh/hamilton/cincinnati.
- Passing it via the request body makes sense to me conceptually, and it's OK if we end up going that way, or even mixing strategies: search/oh/hamilton/cincinnati?=arrests
- If you have a muni, the state and county are just inferred—so it's really just record type and location. It won't be possible to create other kinds of searches yet.

joshuagraber commented 2 months ago

Ah, gotcha @josh-chamberlain, that all makes sense to me.

If you have a muni, the state and county are just inferred—so it's really just record type and location. It won't be possible to create other kinds of searches yet.

I think we will have edge cases here. I just did a basic filtering of all data source records and it shows that we have duplicate municipality names in different states. But I suppose as long as we're okay with that, it's nbd.

As far as flexibility of API and sharability of links, the easiest path forward here, to me, is to use query params.

So that way/search?state=oh&county=franklin&muni=cincinnati&term=arrests, /search?muni=cincinnati&term=arrests, /search?term=arrests could all be valid requests, and we don't have to worry about ugliness like /search/undefined/undefined/cincinatti/arrests or w/e

josh-chamberlain commented 2 months ago

@joshuagraber yes, the duplicate muni names are why you would need the /state/county as a disambiguation.

To be clear: The goal of the typeahead is to treat places as objects and stop trying to match on strings. I would like to not allow people to create /undefined/undefined/cincinnati—we should only let people find a municipality object, which would require having the state and county. One problem we found with quick search was that people would search for a term like "Dallas" and get results from every "Dallas" in the country, rather than the typeahead helping them choose between Dallases.

That said, you quickly get into trouble if you let someone search for multiple record types (or, more accurate to the wireframe, record type categories). So far, I think the cleanest way is still /state/county/muni?category={record_type_category}.

maxachis commented 2 months ago

@joshuagraber @josh-chamberlain From the backend perspective, creating a typeahead endpoint will be fun! But it will be fun in the same way running the Pittsburgh Half Marathon was fun.

I've never developed a typeahead endpoint, which automatically adds to the fun.

Additional points of fun:

How do we handle possible misspellings? Cincinnati is a great example, because I would probably get it wrong on my first attempt. We may want to ignore that for now, as it adds a lot of additional dimensions.
When should typeahead kick in? My guess is at 3 characters in, at least, because otherwise suggestions will be quite broad, and I suspect the number of people searching for Ai, Ohio will be relatively slim.
We may need to implement rate limiting, because otherwise we might send a lot of requests.
Also, we probably want to implement debouncing, which will make sure we wait for some delay before we start sending requests.
I'll need to know the format in which typeahead suggestions should be returned, as well as the format which the final search result will be in.

joshuagraber commented 2 months ago

The goal of the typeahead is to treat places as objects and stop trying to match on strings.

@josh-chamberlain gotcha, that makes sense. Sorry, I think the design is a bit misleading to me in this respect, because it appears that the user can select state / county / municipality, then search.

That said, you quickly get into trouble if you let someone search for multiple record types

Hmm, I think I also misread the design here. So rather than checkboxes, these will need to be radio buttons with only one selectable, no? I'll need to make another update to design-system. Will create an issue for that shortly.

How do we handle possible misspellings?

@maxachis would this library be potentially helpful? I know this is more or less Google's strategy as well, although their algorithm is almost certainly proprietary.

Also, we probably want to implement debouncing, which will make sure we wait for some delay before we start sending requests.

Yes, debouncing is a best practice in client engineering for this type of work. I won't bother implementing it though, because we use lodash utils, and their debounce function is well tested and consistently maintained.

When should typeahead kick in?

I think we should just wait for the debounced function to fire a request, rather than set a hardcoded number of characters.

We may need to implement rate limiting, because otherwise we might send a lot of requests.

Unless this endpoint will be available publicly, we shouldn't strictly need rate limiting if we're debouncing properly on the client. But if you feel like doing it, I suppose it's a good performance guardrail anyways.

josh-chamberlain commented 2 months ago

@joshuagraber

the design is a bit misleading

Yes—the design has not yet been updated to reflect that we want to do a typeahead. Sorry about that. Vikram has responded and will be back online starting ~Monday to make edits

multiple record types

Sorry again for the confusion! I think the design is correct, and I was just thinking out loud—we should allow multi-select. Current favorite answer is a url path for /search/state/county/muni/?type={args} so that we have a nice solid URL for geography (most common search facet) with room for flexible args. Hopefully that works...I updated the issue.

misspellings

IMO, typeahead helps here—if someone gets a "no locations found" message, that's a clue that they mistyped. We can get better at this over time.

I think Joshua covered the rest. Thanks for helping think it through, y'all

joshuagraber commented 2 months ago

@josh-chamberlain Ah, gotcha, that makes more sense now, thank you for explaining.

Current favorite answer is a url path for /search/state/county/muni/?type={args} so that we have a nice solid URL for geography (most common search facet) with room for flexible args

I like that. We could just pass an array of record type values: /search/state/county/muni/?types={serialized record_type[]}

IMO, typeahead helps here—if someone gets a "no locations found" message, that's a clue that they mistyped. We can get better at this over time.

True enough, and if we collect a db of search queries by session, it could be an interesting problem to throw at ML.

maxachis commented 1 month ago

Putting together potentially useful articles on the matter:

Implementing Auto Complete with Postgres and Python

maxachis commented 1 month ago

@josh-chamberlain Regarding pulling locality from the agencies table, what column would that be? municipality? name?

josh-chamberlain commented 1 month ago

@maxachis yep, municipality is what we currently call that. switching to locality is generally more inclusive; for now, we can have a front end and back end terminology adjustment, which is not the end of the world, but creates tech/information debt we should still fix: https://github.com/Police-Data-Accessibility-Project/data-sources-app/issues/332

maxachis commented 1 month ago

@joshuagraber @josh-chamberlain Just to confirm -- debouncing is done on the frontend through the Iodash library that Joshua mentioned, correct? So no need to do that on the backend?

joshuagraber commented 1 month ago

@maxachis @josh-chamberlain Yes, we'll debounce client-side, but IMO it might not hurt to add some rate limiting on the API side as well (for all endpoints, really) if it's not too big of a lift, just as a protective measure against a surprise hosting bill.

josh-chamberlain commented 1 month ago

rate limiting: #353

Police-Data-Accessibility-Project / data-sources-app

v2: main page/search #249

Front end

API changes