Main search endpoint - Githubissues

maxachis commented 1 month ago

Context

Part of #249

Requirements

[x] accepts a coarse record type and geographic location
[x] we should structure the URL like /search/{state}/{county}/{locality}?category={args}
- for any locality, state and county are inferred; for county, state is inferred and locality is blank; there is no way to search for just a string on locality
- the design lets people pick a category; if they pick "all" we don't need to attach args
- for "all" record types,
[x] expose requests to the front end via API endpoint

maxachis commented 1 month ago

Database Design:

Run a one time script to remove all record_types listed as test from prod database.
Create record_type enum, listing all the permissible values.
Convert the record_type column in the data_sources table to utilize this.

Query Design:

This will either be a dynamic query or several queries conditionally chosen based on which values are present. The latter is faster but more of a pain to develop, while the former is easier to develop but would be slower (hopefully not by much).

    SELECT
        data_sources.airtable_uid,
        data_sources.name AS data_source_name,
        data_sources.description,
        data_sources.record_type,
        data_sources.source_url,
        data_sources.record_format,
        data_sources.coverage_start,
        data_sources.coverage_end,
        data_sources.agency_supplied,
        agencies.name AS agency_name,
        agencies.municipality,
        agencies.state_iso
    FROM
        agency_source_link
    INNER JOIN
        data_sources ON agency_source_link.airtable_uid = data_sources.airtable_uid
    INNER JOIN
        agencies ON agency_source_link.agency_described_linked_uid = agencies.airtable_uid
    INNER JOIN
        state_names ON agencies.state_iso = state_names.state_iso
    WHERE
        state_names.state_name = '{state_name}'
        AND agencies.county_name = '{county_name}' -- if county included. Otherwise this is excluded
        AND agencies.municipality = '{municipality}' -- if municipality included. Otherwise, this is excluded.
        AND data_sources.record_type = '{record_type}' -- if specific record type. Otherwise this is excluded.
        AND data_sources.approval_status = 'approved'
        AND data_sources.url_status not in ('broken', 'none found')

Endpoint Design

Because blank values can be a pain, we could instead use an endpoint structured like /search?state=state&county=county&locality=locality&category=args where different parameters can be excluded as need be. We can return an error if certain critical values are excluded, but we'll otherwise infer a blank value to mean all. Error cases include:

State is empty
County is empty when Locality is not.

Here's a ChatGPT-aided prototype of what the request parsing logic might look like.

@app.route('/search')
def search():
    state = request.args.get('state')
    county = request.args.get('county')
    locality = request.args.get('locality')
    category = request.args.get('category')

Backend Logic

The backend logic will be fairly straightforward -- it will call the query, structure the results in a json format comparable to that of quick-search, and return the results.

maxachis commented 1 month ago

@josh-chamberlain

Lingering Questions

"Can be inferred"

You state that county/state can be inferred if other options are selected. I interpret this to mean that we derive the county/state based on the locality.

However, if my understanding is correct, the same locality may appear in multiple counties/states, and the same county may appear in multiple states.

By that logic, inferring county/state may lead to confusion and errors. Are we sure we want to infer like that?

Are only typeahead suggestions valid?

Each typeahead suggestion, as currently designed in #345 , contains information on the state, county, and locality in its response. Are users expected to select one of the available typeahead suggestions when performing a search, i.e. they can't input their own string and search on that?

josh-chamberlain commented 1 month ago

@maxachis all sounds ok! Couple notes:

query design

Note that record_type is really record_type_category or record_category, as they're grouped here. I love the idea of having an enum defined for record types and their categories, as a source of truth for all our stuff which references them.

can be inferred

I think my phrasing is just out of date—the typeahead exists so that people can only select a non-ambiguous geographical object. If someone types "Paris" we'll show them "Paris, OH" and "Paris, TX" and every other Paris. When they select it, we'll know which one they want. So, we're not inferring it so much as the user is selecting it from a narrowed-down list of options.

only typeahead suggestions

That's right—we don't use a string for this field.

maxachis commented 1 month ago

query design

Note that record_type is really record_type_category or record_category, as they're grouped here. I love the idea of having an enum defined for record types and their categories, as a source of truth for all our stuff which references them.

@josh-chamberlain It might make sense to go one further and have tables within the database which lists the relationship between both. Here's a prototype:

CREATE TABLE record_categories (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) UNIQUE NOT NULL,
    description TEXT
);

CREATE TABLE record_types (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) UNIQUE NOT NULL,
    category_id INT NOT NULL,
    description TEXT,
    FOREIGN KEY (category_id) REFERENCES record_categories(id)
);

The data_sources table can be adjusted to be a foreign key to record_types. Then we'd incorporate the categories to organize all the record types when performing a search.

maxachis commented 1 month ago

@josh-chamberlain Note that in my current design, I'm creating a record_type_id column in the data_sources table which references the record_types table. The current data_sources.record_types column would remain for the moment, to ensure backwards compatibility with the airtable data source mirror and other code which references the record_type column. But in this design, I would want to eventually remove the record_type column and modify the related code.

josh-chamberlain commented 1 month ago

@maxachis yes, it's a bit messy to allow for backwards compatibility, but I like the approach.

maxachis commented 3 weeks ago

Per my discussions with @joshuagraber , one minor addition to the search functionality is that the search parameters should be case-insensitive. This is a trivial adjustment, but would make it more friendly to third-party users.

Previously, I had been expecting it to only be called by the app directly, so I thought enforcing case was fine. But if we expect to have third party users do it as well, it makes more sense to allow for capitalization variation.

joshuagraber commented 3 weeks ago

@josh-chamberlain One other thing I was thinking about here: the QuickSearchForm will need to be updated in design-system and we'll need a coordinated release of pdap.io to match the updated search endpoint once we go to prod.

josh-chamberlain commented 3 weeks ago

@joshuagraber I think this is a totally new endpoint—the v1 quick search works differently. I was thinking we would just make a new component for the new search, and the other one could be deprecated by omission. Does that sound OK, am I missing something?

josh-chamberlain commented 3 weeks ago

I edited the issue to make it clear that we should also be able to accept string values like v1 search

maxachis commented 3 weeks ago

[ ] as a fallback, can accept a record_type_string and location_string which are more permissive (should work like the v1 quicksearch endpoint)

this is to give us the No results found. Search for "string"? option in the typeahead

@josh-chamberlain This adds an interesting wrinkle, so I want to make sure I have this understood:

Does this No results found result pop up only when no typeahead suggestions are produced?
Are there two No results found results, one for location and one for record type? So a person could have no results found for the location but still be able to specify a record type?
Do we really need a record_type_string, since there are only a limited set of record types, and all sources are designed to be assigned to one of those sets?
Actually, if a result isn't shown by typeahead suggestions (which is itself more permissive), what are the odds it would show up in a more general search? How much more general is the general search, in our envisioning?

josh-chamberlain commented 2 weeks ago

yes!
it's really just for location. I think I suggested it for record type in the name of symmetry and future thinking; "basically just as much work to do both at once and not worry about it later". Just location is fine.
no, we don't.
this is designed to cover edges that I'd swear exist but at this point I'm struggling to find them. I remember the discussion, decision, and design of this feature...I will take another look in the morning and see if I can find the edges that would make this worthwhile but I'm leaning toward removing this requirement and keeping the search simpler/on geographic rails. we like to try to give people a friendly experience but I'm just not sure more complexity is the friendliest thing in this case!

josh-chamberlain commented 2 weeks ago

@maxachis yeah, so I can't find any reason why we should accept a fallback, so I'm going to do the "flop" part of the flip-flop now: this is no longer a requirement because we don't have any other properties to search with a string, aside from maybe agencies.name, which is a stretch. I adjusted the parent issue again.

Police-Data-Accessibility-Project / data-sources-app

Main search endpoint #352

Context

Requirements

Database Design:

Query Design:

Endpoint Design

Backend Logic

Lingering Questions

"Can be inferred"

Are only typeahead suggestions valid?