Police-Data-Accessibility-Project / data-sources-app

An API and UI for using and maintaining the Data Sources database
MIT License
2 stars 5 forks source link

saved searches #21

Open josh-chamberlain opened 1 year ago

josh-chamberlain commented 1 year ago

Saved searches are the building blocks of a lot of things! For example, taking actions when the results of a saved searchchange or creating a shared source of truth

This doesn't need to come until we do #14

Requirements

maxachis commented 5 months ago
  • [ ] this can be one table, with properties like id, user_id, timestamp, and query
  • [ ] we should keep track how many saved searches a given source is part of; sounds like an expensive query?
  • [ ] saved searches must have permalinks, e.g. https://pdap.io/search/UID123

@josh-chamberlain I would propose two modifications:

  1. Add an additional column called permalink for storing the permanent link.
  2. Add an additional table called saved_search_sources which tracks which data sources are associated with each saved search.

One additional question: Because our queries currently include a "What are you searching for" and "From where" option, how should we format the query? Does that all get condensed into a single string, or should we break the query up into query_what and query_where columns. I think the latter will make it easier to search, but there may be components I'm not taking into account.

maxachis commented 5 months ago
  • [ ] we need an endpoint to get saved and recent searches by user

@josh-chamberlain I want to make sure I understand what is meant by 'saved and recent", since as previously mentioned, all user searches are considered saved. When we say recent, are we talking within a time frame, or the last x searches? Are there any other instances, aside from when a logged-in user performs a search, where a search is considered saved?

josh-chamberlain commented 5 months ago

@maxachis

  1. the permalink is just the UID, so we should be able to generate it on the fly. Any reason we should store it?
  2. a saved search ≠ its results; if you visit a saved search in a year, chances are the results will be different.

all searches are saved ("logged" in a way), but when we talk about a user's saved searches, we're talking about searches they have manually saved.

"recent" can mean the last, say, 10. we can change our definition of "recent" over time if we want.

the format of the saved query is up for debate, I think—and should be somewhat future-facing. see https://github.com/Police-Data-Accessibility-Project/data-sources-app/issues/14 for a more complete idea of all the things filters could contain. rather than a column for each search facet, maybe the query property should just be a JSON object which stores key: value pairs for different properties.

maxachis commented 5 months ago

@josh-chamberlain

  1. the permalink is just the UID, so we should be able to generate it on the fly. Any reason we should store it?

I think that depends on how the saved queries work. I figured the search UIDs would be used to perform a lookup of the search in a database, retrieve the search parameters, and rerun them. Otherwise, how are we matching the UIDs with the particular search? And from where will the UIDs be generated?

the format of the saved query is up for debate, I think—and should be somewhat future-facing. see https://github.com/Police-Data-Accessibility-Project/data-sources-app/issues/14 for a more complete idea of all the things filters could contain. rather than a column for each search facet, maybe the query property should just be a JSON object which stores key: value pairs for different properties.

A JSON object might be the better option if we're intending to simply retrieve the search query and re-execute, rather than perform analyses on the parameters within the query itself (for example, how many people save queries for "Pittsburgh" as opposed to "Cincinnati"). The main challenges would lie in ensuring data integrity and backwards compatibility -- if our search logic changes at some point, we'd want to know if the previously saved searches are still valid (and if not, what to do with them -- perhaps convert them to a new format).

At some point in the future, MongoDB or similar NoSQL databases might be useful to consider, as those are designed for storing information with flexible schemas, such as JSON. However, that might be overkill at the current moment, especially since our needs for what to do with the queries is minimal. However, may be worth keeping in the back of our minds.

maxachis commented 5 months ago
  1. a saved search ≠ its results; if you visit a saved search in a year, chances are the results will be different.

@josh-chamberlain In which case, I may need clarification on what is meant by

we should keep track how many saved searches a given source is part of; sounds like an expensive query?

If I'm understanding this correctly, and we are looking to track how often a given search is returned in queries, we could create a table which increments every time a data source is retrieved in a search/saved search.

josh-chamberlain commented 5 months ago

I think that depends on how the saved queries work. I figured the search UIDs would be used to perform a lookup of the search in a database, retrieve the search parameters, and rerun them. Otherwise, how are we matching the UIDs with the particular search? And from where will the UIDs be generated?

yeah, I think we're saying the same thing—I meant we don't need to store the url if we just know to use that route, but I'm splitting hairs 🤷

A JSON object might be the better option if we're intending to simply retrieve the search query and re-execute, rather than perform analyses on the parameters within the query itself (for example, how many people save queries for "Pittsburgh" as opposed to "Cincinnati"). The main challenges would lie in ensuring data integrity and backwards compatibility -- if our search logic changes at some point, we'd want to know if the previously saved searches are still valid (and if not, what to do with them -- perhaps convert them to a new format).

yes, I think logging is a separate concern. Backwards compatibility is definitely a consideration, but if we're keeping logs, I think it's OK to change the format. I made a new issue.

At some point in the future, MongoDB or similar NoSQL databases might be useful to consider, as those are designed for storing information with flexible schemas, such as JSON. However, that might be overkill at the current moment, especially since our needs for what to do with the queries is minimal. However, may be worth keeping in the back of our minds.

Absolutely—we could also use it to dump actual data if we stick some semi-predictable metadata to it and hook it up to elasticsearch.

If I'm understanding this correctly, and we are looking to track how often a given search is returned in queries, we could create a table which increments every time a data source is retrieved in a search/saved search.

That's a good idea! I think this is solved by #270

maxachis commented 5 months ago

@josh-chamberlain So next question is related to the form of the UUID.

One thing I'll also need clarification on -- does a saved search, when accessed, merely provide a suite of results? Will anyone other than the user themselves know which saved searches are associated with them? Because that will affect how concerned we are about issues like a hacker trying to enumerate through all saved searches.

The way I see it, there are two options, and one option we definitely don't want to do.

🚫Serial IDs 🚫

Serial IDs can be guessed or enumerated through, meaning that an individual could use them to effectively find every single search in our database, or to identify a user associated with a suite of saved searches by cross-referencing with a number of saved searches clearly related to each other by the proximity of their serial ids. Bad.

Simple UUID

This is the easiest solution: We generate a random 128-bit UUID which is almost certainly mathematically impossible to ever be made again ever. This would look like 550e8400-e29b-41d4-a716-446655440000. Simple, but not exactly easy to share.

Bitly-style short ID

Generate a random short ID like 2KEOXNx. Much easier to share, however, has some issues:

  1. Risk of generating the same ID twice increases -- still generally quite small, but we have to figure out how to handle them if and when they occur, which would become an increasing issue as the number of saved searches increases.
  2. Increased risk of enumeration attacks: i.e., an enterprising hacker could try to generate all possible combinations to get access to all searches (or, alternatively, to overload the system by generating a bunch of saved searches at once). Can be averted with rate limiting and/or limiting the number of saved searches a user can have, but either way, more overhead.
josh-chamberlain commented 5 months ago

@maxachis serial IDs being guessable doesn't strike me as an issue—there's no secure information there. a search itself is just a query and datetime.

a saved search, when accessed, hits the API with the search parameters associated with the search and shows the results in the front end. Half the point of saved searches with a UID is that users can share them with each other.

hey, check it out! I searched for data in pittsburgh and found some cool stuff: https://pdap.io/search/123

the database will store a user ID with each search (or a list of search IDs per a user which seems backward to me), so that when a user goes to find their recent searches, they will see the list. We can be selective about which endpoint reveals that info, but either way, you'd need to somehow know the identity of a user by their ID, which we are trying to protect anyway—a user shouldn't even know their own ID.

maxachis commented 5 months ago

@josh-chamberlain In which case, serial ids won't be a problem! I'll utilize serial ids as the solution.

josh-chamberlain commented 5 months ago

@maxachis as I thought may happen, there are some changes to the plan which will make saved searches unimportant until we build advanced search filtering—which is maybe the least launch-critical feature—because we are going to stick to a URL format for search. People will only need to save searches if they make a search which can't be expressed via the URL, which they won't be able to do.

maxachis commented 4 months ago

@josh-chamberlain In that case, I'll close the pull request I made for this just to avoid clutter. We can still use its material (if it's still relevant) when the time comes.