Open dpalomino opened 8 years ago
I suggest that before we proceed with tackling this feature, we should answer the usual engineering question: build or buy? (Do we want to create our own custom search engine, or can we leverage an existing search engine?)
I recommend using Elasticsearch; it's open source, easy to implement, solid REST API, powerful, scalable, reliable, etc. Includes all the features from our requirements and more. Even if we had time allotted to build an engine in-house, the ongoing optimization, scaling, and maintenance would be a job in itself.
Thanks a lot @seav, @amplifi for the input. Sure, if there is something open source that we can reuse and adapts well to our needs, much better.
Taking a look to Lucene syntax (Elasticsearch based on it) it seems that cover our requirements regarding search in text fields, dates or integers.
Here are some suggestions on implementing the search feature using Elasticsearch (ES) based on my (limited) research and experiments. Please feel free to comment, ask questions, and poke holes. Hopefully we can then refine this and so have a clearer idea of how to implement search for Sprint 11 (and beyond?).
ES is intended to be run as a cluster of nodes with a single node providing the API endpoint. I assume that this is what @amplifi has already provisioned in AWS. Note that the API has no authentication by default (thought there's the Shield plugin for that), but to make things simple, I would suggest just implementing an isolated cluster and only allowing the Cadasta platform server access to the API via routing/firewall restrictions and only allowing HTTPS access to the API for added security.
This also means that we need three ES clusters, one each for the staging, demo, and production platforms if we want search to be functional on all three. But these 3 clusters can all probably share the same physical/virtual nodes—we just set different IP ports for the API or designate different nodes for the endpoints.
_Note: To make sense of this section, please read the Basic Concepts section of the ES docs._
Based on the requirements, there is no need to have inter-project search capability. So every search request will always be within a single project. Therefore, I suggest that each project have its own individual ES index. And each class of records (location, party, 3 types of relationships) be implemented as an ES type. Having separate indexes for the projects would make search more performant since ES doesn't need to filter irrelevant records from other projects. This also allows us to fine-tune each index's performance attributes based on the amount of data in each project.
This structure implies that we can't use many Django-ES libraries or wrappers like the popular Haystack plugin. Many such libraries/wrappers assume that the Django website will only need a single index for search. So we would need to implement a custom solution. But we could still use helper libraries like the Python Elasticsearch Client to simplify coding some stuff.
To first populate the ES indexes, I suggest adding a reindexsearch
management command which drops everything and then reindexes the whole database. I guess we can also allow an optional project slug argument so that only that project's index gets reindexed.
To simplify lookup and update of individual records, I suggest that we reuse the records' ID as their document ID in ES too.
When a record is created/updated/deleted, I think we can just add a post_save
signal that updates the record's document in ES. The big question that I see is whether this operation should be synchronous or not. Note that we are accessing ES through a network so we potentially have network reliability issues. I am currently leaning towards making the search index update synchronous to make things simpler, but if there's a lightweight async way (fork a sub-process or thread?) then that would be better. Any failures to update the index should be logged so that we can correct things and prevent the index from slowly becoming out of sync. There are potentially some rare race conditions possible with async (two people updating the same record at the same time and they get updated in ES in the wrong order) since async is not atomic by definition.
Any query language we expose in the UI needs to be translated into the ES query language, but this should be pretty easy to implement. Pagination is also supported by ES, but because ES does not maintain state across API accesses, it is possible that data would be inconsistent when the user pages through the search results (a newly added record could appear in the results) or subsequently exports the results into an Excel file. I don't think this is a problem.
For Django itself, I suggest creating a search
app to contain all search-related code. And because the documents indexed by ES are just JSON documents, we can reuse the existing DRF serializers, though I think they need to be modified to allow a more "flat" structure since that is closer to how ES actually indexes such documents (so no GeoJSON for locations) and JSON attributes should be folded into the top JSON level instead of residing under the "attributes"
field.
For the dev VM, basically we need to mock the ES cluster. I guess we can just implement a simple HTTP server that allows us to dynamically program the HTTP response in the tests. This server should be reusable for both unit tests and functional tests.
A few notes:
Infrastructure has already been provisioned and configured for a single cluster with network-, group-, and user-level access restrictions allowing valid traffic from the platform server only.
But these 3 clusters can all probably share the same physical/virtual nodes—we just set different IP ports for the API or designate different nodes for the endpoints.
This isn't possible under the existing platform architecture; it would violate our environment isolation requirements. It'd also expose us to a scenario where issues causing excessive processing times on our staging indices would impact our demo and production performance. We'll need to evaluate whether search is necessary for all three environments, and if so, implement our ES model for each.
To first populate the ES indexes, I suggest adding a reindexsearch management command which drops everything and then reindexes the whole database. I guess we can also allow an optional project slug argument so that only that project's index gets reindexed.
We'll want to implement indexing off-the-bat with more granularity, especially when it comes to situations where a release version includes a significant migration. With larger data sets, this will be a non-trivial operation and running a system-wide reindexing won't be feasible once live.
This is also where making good use of aliases right from the start will be a huge help to avoid any downtime during reindexing.
I am currently leaning towards making the search index update synchronous to make things simpler, but if there's a lightweight async way (fork a sub-process or thread?) then that would be better.
Index update processes have long execution times by nature; Lucene writes are expensive. Updates take longer than the initial indexing. For our purposes, we should seek to avoid too-frequent updates and prioritize batch updates via the bulk API (which allows ES to leverage its own optimizations to reduce run times, minimizes connection times and request overhead especially with SSL, etc). Given our data sets, we have less of a need for immediate sync than for reliable performance. We should use an asynchronous process for updates; a queue with deduplication (to prevent the rare race condition you mentioned) run via a single background worker or using transactional locking -- this also greatly simplifies error handling and prevents replication issues. Initial runs or re-indexing should use ranges executed periodically to avoid delays from queuing large volumes of data. Combining these two methods to cover the two distinct indexing cases should yield the best performance and let us stagger workload more efficiently.
Pagination is also supported by ES, but because ES does not maintain state across API accesses, it is possible that data would be inconsistent when the user pages through the search results (a newly added record could appear in the results) or subsequently exports the results into an Excel file. I don't think this is a problem.
For pagination, results are sorted and can be accessed using the 'size' and 'from' parameters, so combining that with bulk updates as above we can basically ensure a user gets a consistent set of results per user query. We should set a conservative limit for max results up front to prevent queries from getting too intensive.
This can all be discussed further in the search call next week :)
@seav how would this work with django-tutelary and the current permissions?
@wonderchook, I am assuming that any permissions on records will only be stored in the platform and not baked into the search index because permissions is tied to the user and is not an attribute of the records. So when records are returned as search results, we would need to check the user's permission to view each record and then discard records from the results if the user doesn't have view permissions. In addition, I assume that if a user doesn't have the list permission, then the search function would not be available to them.
This would be complicated and inefficient to implement in a user-friendly way though because we can't say upfront that there are a total of N results because computing N based on permissions is inefficient, and paging through the results would be complex. (You can ask Elasticsearch to return results 11–20 but what happens if the 19th result is discarded? Do you make another query to retrieve the 21st result? But what if that is also discarded?)
An alternative is to display the results whether the user has permission to view the resulting records or not. But if the user follows the link to see a record's detail, then they get a permission denied error. This is not satisfactory from a privacy point of view though because the search results would necessarily leak information about the records.
I've updated the search wireframes to reflect our discussion about what we can accomplish for phase 1: https://drive.google.com/open?id=0BzpiEtMtHC3rVFlVelgzNHBxaUk
Summary
@dpalomino Can you review and let me know if this covers everything? I wanted to have a clear set of wireframes for this next sprint's planning meeting. I will add the additional wireframes but need to discuss the phases with you. Thanks!
I've updated the search wireframes based on our latest discussion: https://drive.google.com/open?id=0BzpiEtMtHC3rVFlVelgzNHBxaUk
I have the phases broken out as such:
Phase 1
Phase 2
Phase 3 (not included in mocks yet)
Additional Options
Thanks really a lot @clash99 for the fantastic wireframes! I think that covers everything for phase 1, and for the following.
@seav @linzjax please let us know if you this makes sense to you.
Some things we should discuss before we start implementing the search UI.
I don’t like the split between tabs on top of the search results and “narrow results” checkboxes on the left. They should not be separated because they both address the same goal: Narrowing down the search results.
Wouldn’t it be possible to combine both into the left-hand-side checkbox panel? Initially, you get checkboxes for locations, relationships, parties and resources. When you select a checkbox, the subcategories expand, e.g. Community Boundary and Parcel for locations.
Thanks a lot @oliverroick for the feedback, let's discuss later on during the dev call.
Regarding the "search guidelines", I can draft the texts for this.
I've drafted some search guidelines in this doc: https://docs.google.com/document/d/16RevzILdt8HrWd5TEX9D5R7nQoEef1kf4x104FsUEXU/edit#
https://drive.google.com/open?id=0BzpiEtMtHC3rVFlVelgzNHBxaUk
I've updated the search wireframes with the following:
Apart from the bugfixing that could appear, I think we can move this issue to "Released" state.
We need a feature to execute search queries and present records that match that query.
Requirements from devwiki
Ability to do search queries and filters, and to present the records (either on a list or a map) that match these queries.
Detailed Feature Description
In a first stage this feature will be implemented only at a project level, so project users will be able to search and filter results for records contained in that project (for now, not at organization or platform level).
Searches could range from very simple single text queries, to multi-field searches filtering by dates, or with
select_one
orboolean
fields for instance.We will be following first a simple intuitive format for the text based queries, and we will progressively introduce more syntax complexity. In a first step, searches will be done using a search box. In a later stage we will be providing an "Advance Search" specific page where we can include more logic in the searches and filters, and in a more user friendly way.
Examples of simple search box queries:
We can also create more powerful queries with a bit more of syntax complexity:
boolean
,date
,integer
, etc): gender:male dob>1981-09-05 JohnAnd so on. Results will be presented both on the map and on a list of locations. For simplicity, results will be always shown ordered by locations and not by parties (although the search queries will affect to the fields in any entity type).
The main inconvenient about this syntax is that the user needs to know both the syntax defined and the field names for creating the more complex queries (and then, it'd be not very accessible to the less technical users). For making these complex searches easier for the user, we will create a new "Advanced search" section from which users could select the fields (from the project data schema) to use for search and filtering. Requirements and wireframes for this will be developed in a second stage.
This feature should be available for project users, and for public users in the case of public projects. Permissions needed to see search results would be SU.List permissions.
User Stories
text
,select_one
andselect_multiple
typesinteger
,decimal
,date
,dateTime
,time
andboolean
typesRequirements
Text-based queries
As a project/public user I want to run text-based queries for any text field including
text
,select_one
andselect_multiple
typesselect_one
andselect_multiple
types, meaning that the words used in the search query will be matched against the different choices names defined in thename
column ofchoices
tab for theselect_one
andselect_multiple
types.Some examples of text-based queries are:
Text-based queries for specific fields
As a project/public user I want to run text-based queries for specific fields in the project schema
Some examples of these are:
Number-based queries for specific fields
As a project/public user I want to run queries for
integer
,decimal
,date
,dateTime
andtime
typesWe will be following ISO 8601 for timestamp types. Summary of date and time formats considered:
When building a query involving timestamps, we need to specify the field name followed by the operator. Supported operators are:
The same operators, following the same notation, will be used for
integer
anddecimal
.Search box
A search box would be integrated in the project view, in a first stage (we will discuss later on the advance search feature). In a second stage auto-suggestions will be shown to the user, after a minimum of 3-4 characters being introduced and for the records matching the mandatory fields. Again, see the wireframes for details.
Presentation of search results
For details, please refer to the wireframes.
The results will be presented in tabs distinguishing among:
Matched fields will be highlighted.
Based on the active tab, some pre-defined filters will be shown on the right panel (location type, party type, relation type).
Access to the details of each result
Clicking on each of the records shown in the list of results will open the location (i.e. the overview tab, as it is currently implemented on the platform when clicking on a spatial unit), the relationships tab for the relations, and a new party section for checking the party details (see wireframes).
Exporting search results
Implementation details
We will be using Elasticsearch, an open-source search engine that works with Lucene Syntax which matches pretty well with the syntax suggested in the user stories.
Infrastructure
We will have three isolated clusters: staging, demo and platform.
Elastic Search Data Structure
As search queries will be executed by project, then each project will have their own ES index. This will allow to do a fine-tune per project if needed. Each class of records (location, party, 3 types of relationships) be implemented as an ES type.
Data Flow
We will be implementing a search index batch update daemon to update the index after updates. See issue #908 for a very detailed explanation. The reason to do this asynchronously is because it is an expensive operation.
Including here some of the information discussed in GH:
The details about how to recreate index in one or several projects can be checked in issue #909
Making Queries
Any query language we expose in the UI needs to be translated into the ES query language, but this should be pretty easy to implement. Pagination is also supported by ES, but because ES does not maintain state across API accesses, it is possible that data would be inconsistent when the user pages through the search results (a newly added record could appear in the results) or subsequently exports the results into an Excel file.
Ancillary Information
Related open issues in GitHub:
External references:
Wireframes
Link to the wireframes can be found here.
Several tasks are included here: