Open josh-chamberlain opened 1 year ago
- [ ] this can be one table, with properties like
id
,user_id
,timestamp
, andquery
- [ ] we should keep track how many saved searches a given source is part of; sounds like an expensive query?
- [ ] saved searches must have permalinks, e.g.
https://pdap.io/search/UID123
@josh-chamberlain I would propose two modifications:
permalink
for storing the permanent link.saved_search_sources
which tracks which data sources are associated with each saved search. One additional question: Because our queries currently include a "What are you searching for" and "From where" option, how should we format the query? Does that all get condensed into a single string, or should we break the query up into query_what
and query_where
columns. I think the latter will make it easier to search, but there may be components I'm not taking into account.
- [ ] we need an endpoint to get saved and recent searches by user
@josh-chamberlain I want to make sure I understand what is meant by 'saved and recent", since as previously mentioned, all user searches are considered saved. When we say recent, are we talking within a time frame, or the last x searches? Are there any other instances, aside from when a logged-in user performs a search, where a search is considered saved?
@maxachis
all searches are saved ("logged" in a way), but when we talk about a user's saved searches, we're talking about searches they have manually saved.
"recent" can mean the last, say, 10. we can change our definition of "recent" over time if we want.
the format of the saved query
is up for debate, I think—and should be somewhat future-facing. see https://github.com/Police-Data-Accessibility-Project/data-sources-app/issues/14 for a more complete idea of all the things filters could contain. rather than a column for each search facet, maybe the query
property should just be a JSON object which stores key: value pairs for different properties.
@josh-chamberlain
- the permalink is just the UID, so we should be able to generate it on the fly. Any reason we should store it?
I think that depends on how the saved queries work. I figured the search UIDs would be used to perform a lookup of the search in a database, retrieve the search parameters, and rerun them. Otherwise, how are we matching the UIDs with the particular search? And from where will the UIDs be generated?
the format of the saved query is up for debate, I think—and should be somewhat future-facing. see https://github.com/Police-Data-Accessibility-Project/data-sources-app/issues/14 for a more complete idea of all the things filters could contain. rather than a column for each search facet, maybe the query property should just be a JSON object which stores key: value pairs for different properties.
A JSON object might be the better option if we're intending to simply retrieve the search query and re-execute, rather than perform analyses on the parameters within the query itself (for example, how many people save queries for "Pittsburgh" as opposed to "Cincinnati"). The main challenges would lie in ensuring data integrity and backwards compatibility -- if our search logic changes at some point, we'd want to know if the previously saved searches are still valid (and if not, what to do with them -- perhaps convert them to a new format).
At some point in the future, MongoDB or similar NoSQL databases might be useful to consider, as those are designed for storing information with flexible schemas, such as JSON. However, that might be overkill at the current moment, especially since our needs for what to do with the queries is minimal. However, may be worth keeping in the back of our minds.
- a saved search ≠ its results; if you visit a saved search in a year, chances are the results will be different.
@josh-chamberlain In which case, I may need clarification on what is meant by
we should keep track how many saved searches a given source is part of; sounds like an expensive query?
If I'm understanding this correctly, and we are looking to track how often a given search is returned in queries, we could create a table which increments every time a data source is retrieved in a search/saved search.
I think that depends on how the saved queries work. I figured the search UIDs would be used to perform a lookup of the search in a database, retrieve the search parameters, and rerun them. Otherwise, how are we matching the UIDs with the particular search? And from where will the UIDs be generated?
yeah, I think we're saying the same thing—I meant we don't need to store the url if we just know to use that route, but I'm splitting hairs 🤷
A JSON object might be the better option if we're intending to simply retrieve the search query and re-execute, rather than perform analyses on the parameters within the query itself (for example, how many people save queries for "Pittsburgh" as opposed to "Cincinnati"). The main challenges would lie in ensuring data integrity and backwards compatibility -- if our search logic changes at some point, we'd want to know if the previously saved searches are still valid (and if not, what to do with them -- perhaps convert them to a new format).
yes, I think logging is a separate concern. Backwards compatibility is definitely a consideration, but if we're keeping logs, I think it's OK to change the format. I made a new issue.
At some point in the future, MongoDB or similar NoSQL databases might be useful to consider, as those are designed for storing information with flexible schemas, such as JSON. However, that might be overkill at the current moment, especially since our needs for what to do with the queries is minimal. However, may be worth keeping in the back of our minds.
Absolutely—we could also use it to dump actual data
if we stick some semi-predictable metadata to it and hook it up to elasticsearch.
If I'm understanding this correctly, and we are looking to track how often a given search is returned in queries, we could create a table which increments every time a data source is retrieved in a search/saved search.
That's a good idea! I think this is solved by #270
@josh-chamberlain So next question is related to the form of the UUID.
One thing I'll also need clarification on -- does a saved search, when accessed, merely provide a suite of results? Will anyone other than the user themselves know which saved searches are associated with them? Because that will affect how concerned we are about issues like a hacker trying to enumerate through all saved searches.
The way I see it, there are two options, and one option we definitely don't want to do.
Serial IDs can be guessed or enumerated through, meaning that an individual could use them to effectively find every single search in our database, or to identify a user associated with a suite of saved searches by cross-referencing with a number of saved searches clearly related to each other by the proximity of their serial ids. Bad.
This is the easiest solution: We generate a random 128-bit UUID which is almost certainly mathematically impossible to ever be made again ever. This would look like 550e8400-e29b-41d4-a716-446655440000
. Simple, but not exactly easy to share.
Generate a random short ID like 2KEOXNx
. Much easier to share, however, has some issues:
@maxachis serial IDs being guessable doesn't strike me as an issue—there's no secure information there. a search itself is just a query and datetime.
a saved search, when accessed, hits the API with the search parameters associated with the search and shows the results in the front end. Half the point of saved searches with a UID is that users can share them with each other.
hey, check it out! I searched for data in pittsburgh and found some cool stuff: https://pdap.io/search/123
the database will store a user ID with each search (or a list of search IDs per a user which seems backward to me), so that when a user goes to find their recent searches, they will see the list. We can be selective about which endpoint reveals that info, but either way, you'd need to somehow know the identity of a user by their ID, which we are trying to protect anyway—a user shouldn't even know their own ID.
@josh-chamberlain In which case, serial ids won't be a problem! I'll utilize serial ids as the solution.
@maxachis as I thought may happen, there are some changes to the plan which will make saved searches
unimportant until we build advanced search filtering
—which is maybe the least launch-critical feature—because we are going to stick to a URL format for search. People will only need to save searches if they make a search which can't be expressed via the URL, which they won't be able to do.
@josh-chamberlain In that case, I'll close the pull request I made for this just to avoid clutter. We can still use its material (if it's still relevant) when the time comes.
Saved searches are the building blocks of a lot of things! For example, taking actions when the results of a saved searchchange or creating a shared source of truth
This doesn't need to come until we do #14
Requirements
id
,user_id
,timestamp
, andquery
https://pdap.io/search/UID123