dpalomino commented 8 years ago

We need a feature to execute search queries and present records that match that query.

Requirements from devwiki

Ability to do search queries and filters, and to present the records (either on a list or a map) that match these queries.

Detailed Feature Description

In a first stage this feature will be implemented only at a project level, so project users will be able to search and filter results for records contained in that project (for now, not at organization or platform level).

Searches could range from very simple single text queries, to multi-field searches filtering by dates, or with select_one or boolean fields for instance.

We will be following first a simple intuitive format for the text based queries, and we will progressively introduce more syntax complexity. In a first step, searches will be done using a search box. In a later stage we will be providing an "Advance Search" specific page where we can include more logic in the searches and filters, and in a more user friendly way.

Examples of simple search box queries:

Query searching all records with any text-based field matching a word: John
Query searching all records with text-based fields matching several words (could be in different fields, default search operator is AND): John Texas
Query searching all records with text-based fields matching an exact sentence using quotation marks: "John Travolta"
Combination of the above: "John Travolta" Texas
Instead of the default AND search operator, we can use OR instead: "John Travolta" OR "Robert de Niro"

We can also create more powerful queries with a bit more of syntax complexity:

For text based queries, specifying the (text-based) field name defined in the questionnaire. Example for all records with 'city' matching with "Austin", and any other field matching with "Robert": city:Austin Robert
Or including other non-text based fields (such as boolean, date, integer, etc): gender:male dob>1981-09-05 John

And so on. Results will be presented both on the map and on a list of locations. For simplicity, results will be always shown ordered by locations and not by parties (although the search queries will affect to the fields in any entity type).

The main inconvenient about this syntax is that the user needs to know both the syntax defined and the field names for creating the more complex queries (and then, it'd be not very accessible to the less technical users). For making these complex searches easier for the user, we will create a new "Advanced search" section from which users could select the fields (from the project data schema) to use for search and filtering. Requirements and wireframes for this will be developed in a second stage.

This feature should be available for project users, and for public users in the case of public projects. Permissions needed to see search results would be SU.List permissions.

User Stories

As a project/public user I want to run text-based queries for any text field including text, select_one and select_multiple types
As a project/public user I want to run text-based queries for specific fields in the project schema
As a project/public user I want to run queries for integer, decimal, date, dateTime, time and boolean types
As a project/public user I want to see the set of records matching the query both on a map and on a list of records
As a project/public user I want to access to the location detail clicking on the location entities shown on the map or on the list of results.
As a project/public user I want to export query results in a .csv file

Requirements

Text-based queries

As a project/public user I want to run text-based queries for any text field including text, select_one and select_multiple types

These text-based queries will also apply to select_one and select_multiple types, meaning that the words used in the search query will be matched against the different choices names defined in the name column of choices tab for the select_one and select_multiple types.
Default search operator will be AND. User could specify this operator in any case (not case sensitive) between two search words separated by blanks (i.e. "john and london" is equivalent to "john london")
User could also specify the OR operator (again not case sensitive) in the same way (example: john or peter).
AND and OR operators will have the same precedence when combined in a single query, and left-associativity will be considered (i.e. left-to-right).
Double quotation marks (i.e. "text") will be used to query for exact phrases.
All search words will be case insensitive

Some examples of text-based queries are:

John will query all records with text fields containing the word John
"John Walker" will query all records with text fields containing the exact sentence "John Walker" (case insensitive)
John "New York" will query all records with text fields containing both John and the exact sentence "New York".
Pamela female "New York" or boston will search all records containing a text field matching with "Pamela", another text field matching with "female", and another text field matching with either "New York". Or just the records containing the word "Boston" (left-to-right).

Text-based queries for specific fields

As a project/public user I want to run text-based queries for specific fields in the project schema

Users will be able to specify the field name to consider when creating a query, writing down the field name followed by colon (:) and the word to search (i.e. field_name:text).

Some examples of these are:

name:John city:"New York", querying all records that "name" field matches with "John" and "city" matches with "New York"
name:John or city:"New York", querying all records that "name" field matches with "John" or "city" matches with "New York"
John gender:male or city:"New York", querying all records that fulfil at least one of these conditions (because of "or" and left-right associativity):
1. Any text-field matches with "John", and gender is "male"
2. "city" field matches with "New York"

Number-based queries for specific fields

As a project/public user I want to run queries for integer, decimal, date, dateTime and time types

We will be following ISO 8601 for timestamp types. Summary of date and time formats considered:

YYYY
YYYY-MM
YYYYMM
YYYY-MM-DD
YYYYMMDD
hh:mm
hhmm
hh:mm:ss
hhmmss

When building a query involving timestamps, we need to specify the field name followed by the operator. Supported operators are:

= (equal to). Example: "dob=1987"
> (greater than). Example: "dob>1987"
>= (greater or equal than). Example: "dob>=1987"
< (less than). Example: "dob<1987"
<= (less or equal than). Example: "dob<=1987"
<> or != (not equal to). Example: "dob<>1987"

The same operators, following the same notation, will be used for integer and decimal.

Search box

A search box would be integrated in the project view, in a first stage (we will discuss later on the advance search feature). In a second stage auto-suggestions will be shown to the user, after a minimum of 3-4 characters being introduced and for the records matching the mandatory fields. Again, see the wireframes for details.

Presentation of search results

For details, please refer to the wireframes.

The results will be presented in tabs distinguishing among:

All results
Locations
Relations
Parties
Resources

Matched fields will be highlighted.

Based on the active tab, some pre-defined filters will be shown on the right panel (location type, party type, relation type).

Access to the details of each result

Clicking on each of the records shown in the list of results will open the location (i.e. the overview tab, as it is currently implemented on the platform when clicking on a spatial unit), the relationships tab for the relations, and a new party section for checking the party details (see wireframes).

Exporting search results

The user should be able to download the results from queries in a .csv format.

Implementation details

We will be using Elasticsearch, an open-source search engine that works with Lucene Syntax which matches pretty well with the syntax suggested in the user stories.

Infrastructure

We will have three isolated clusters: staging, demo and platform.

Elastic Search Data Structure

As search queries will be executed by project, then each project will have their own ES index. This will allow to do a fine-tune per project if needed. Each class of records (location, party, 3 types of relationships) be implemented as an ES type.

Data Flow

We will be implementing a search index batch update daemon to update the index after updates. See issue #908 for a very detailed explanation. The reason to do this asynchronously is because it is an expensive operation.

Including here some of the information discussed in GH:

The daemon must expose a simple HTTP API that can be accessed by the platform for the latter to push metadata on records/resources that were updated. There should also be API endpoints so that the platform can inform the daemon on which projects are existing and active so that the daemon can manage its queues.
In the short term, the daemon can run on the same network node that runs the platform. In the long term, the daemon can run on a separate node if performance is an issue.
In the short term the queues can be stored in-memory. In the long term, these queues can be stored in the database if performance is an issue.
The batch update operation needs to be run periodically on each project's queue. The update interval should be specific to and configurable per project, depending on the amount of data and any existing conditions. For example, if a project is currently having an import operation, then the update operation can be suspended until after the import is finished. Some feasible intervals are between 5 minutes to 1 hour.
Because ES accepts searchable documents to be indexed as JSON, the daemon needs information about how to serialize the records/resource metadata into JSON. So it would be nice if the daemon can reuse the platform DRF JSON serializers.
The daemon code may be a separate GH repository if that makes sense.

The details about how to recreate index in one or several projects can be checked in issue #909

Making Queries

Any query language we expose in the UI needs to be translated into the ES query language, but this should be pretty easy to implement. Pagination is also supported by ES, but because ES does not maintain state across API accesses, it is possible that data would be inconsistent when the user pages through the search results (a newly added record could appear in the results) or subsequently exports the results into an Excel file.

Ancillary Information

Related open issues in GitHub:

Records search feature (epic): issue 825
Text-based queries: issue 826
Text-based records search queries for specific fields: issue 827
Number-based records search queries (for integer, decimal, and date/time types): issue 828
Presentation of records search results: issue 829
Export results from records search query in xlsx format: issue 830
Implement a search index batch update daemon: issue 908
Implement a search reindex management command: issue 909
Implmenet a mock ES cluster for testing purposes: issue 910

External references:

Wireframes

Link to the wireframes can be found here.

Several tasks are included here:

Text-based queries for any text field (queries applying to any field text)
Text-based queries for specific fields (being able to specify which fields are considered in the query)
Number-based queries (for integer, decimal, and date/time types)
Presentation of search results (on a map and a list of records), and accessing to the location details
Being able to export the results in the same format used for downloading the project data (in *.xlsx).

seav commented 8 years ago

I suggest that before we proceed with tackling this feature, we should answer the usual engineering question: build or buy? (Do we want to create our own custom search engine, or can we leverage an existing search engine?)

amplifi commented 8 years ago

I recommend using Elasticsearch; it's open source, easy to implement, solid REST API, powerful, scalable, reliable, etc. Includes all the features from our requirements and more. Even if we had time allotted to build an engine in-house, the ongoing optimization, scaling, and maintenance would be a job in itself.

dpalomino commented 8 years ago

Thanks a lot @seav, @amplifi for the input. Sure, if there is something open source that we can reuse and adapts well to our needs, much better.

Taking a look to Lucene syntax (Elasticsearch based on it) it seems that cover our requirements regarding search in text fields, dates or integers.

seav commented 8 years ago

Here are some suggestions on implementing the search feature using Elasticsearch (ES) based on my (limited) research and experiments. Please feel free to comment, ask questions, and poke holes. Hopefully we can then refine this and so have a clearer idea of how to implement search for Sprint 11 (and beyond?).

Infrastructure

ES is intended to be run as a cluster of nodes with a single node providing the API endpoint. I assume that this is what @amplifi has already provisioned in AWS. Note that the API has no authentication by default (thought there's the Shield plugin for that), but to make things simple, I would suggest just implementing an isolated cluster and only allowing the Cadasta platform server access to the API via routing/firewall restrictions and only allowing HTTPS access to the API for added security.

This also means that we need three ES clusters, one each for the staging, demo, and production platforms if we want search to be functional on all three. But these 3 clusters can all probably share the same physical/virtual nodes—we just set different IP ports for the API or designate different nodes for the endpoints.

ES data structure

_Note: To make sense of this section, please read the Basic Concepts section of the ES docs._

Based on the requirements, there is no need to have inter-project search capability. So every search request will always be within a single project. Therefore, I suggest that each project have its own individual ES index. And each class of records (location, party, 3 types of relationships) be implemented as an ES type. Having separate indexes for the projects would make search more performant since ES doesn't need to filter irrelevant records from other projects. This also allows us to fine-tune each index's performance attributes based on the amount of data in each project.

This structure implies that we can't use many Django-ES libraries or wrappers like the popular Haystack plugin. Many such libraries/wrappers assume that the Django website will only need a single index for search. So we would need to implement a custom solution. But we could still use helper libraries like the Python Elasticsearch Client to simplify coding some stuff.

Data flow

To first populate the ES indexes, I suggest adding a reindexsearch management command which drops everything and then reindexes the whole database. I guess we can also allow an optional project slug argument so that only that project's index gets reindexed.

To simplify lookup and update of individual records, I suggest that we reuse the records' ID as their document ID in ES too.

When a record is created/updated/deleted, I think we can just add a post_save signal that updates the record's document in ES. The big question that I see is whether this operation should be synchronous or not. Note that we are accessing ES through a network so we potentially have network reliability issues. I am currently leaning towards making the search index update synchronous to make things simpler, but if there's a lightweight async way (fork a sub-process or thread?) then that would be better. Any failures to update the index should be logged so that we can correct things and prevent the index from slowly becoming out of sync. There are potentially some rare race conditions possible with async (two people updating the same record at the same time and they get updated in ES in the wrong order) since async is not atomic by definition.

Making queries

Any query language we expose in the UI needs to be translated into the ES query language, but this should be pretty easy to implement. Pagination is also supported by ES, but because ES does not maintain state across API accesses, it is possible that data would be inconsistent when the user pages through the search results (a newly added record could appear in the results) or subsequently exports the results into an Excel file. I don't think this is a problem.

Django innards

For Django itself, I suggest creating a search app to contain all search-related code. And because the documents indexed by ES are just JSON documents, we can reuse the existing DRF serializers, though I think they need to be modified to allow a more "flat" structure since that is closer to how ES actually indexes such documents (so no GeoJSON for locations) and JSON attributes should be folded into the top JSON level instead of residing under the "attributes" field.

Testing

For the dev VM, basically we need to mock the ES cluster. I guess we can just implement a simple HTTP server that allows us to dynamically program the HTTP response in the tests. This server should be reusable for both unit tests and functional tests.

amplifi commented 8 years ago

A few notes:

Infrastructure has already been provisioned and configured for a single cluster with network-, group-, and user-level access restrictions allowing valid traffic from the platform server only.

But these 3 clusters can all probably share the same physical/virtual nodes—we just set different IP ports for the API or designate different nodes for the endpoints.

This isn't possible under the existing platform architecture; it would violate our environment isolation requirements. It'd also expose us to a scenario where issues causing excessive processing times on our staging indices would impact our demo and production performance. We'll need to evaluate whether search is necessary for all three environments, and if so, implement our ES model for each.

To first populate the ES indexes, I suggest adding a reindexsearch management command which drops everything and then reindexes the whole database. I guess we can also allow an optional project slug argument so that only that project's index gets reindexed.

We'll want to implement indexing off-the-bat with more granularity, especially when it comes to situations where a release version includes a significant migration. With larger data sets, this will be a non-trivial operation and running a system-wide reindexing won't be feasible once live.

This is also where making good use of aliases right from the start will be a huge help to avoid any downtime during reindexing.

I am currently leaning towards making the search index update synchronous to make things simpler, but if there's a lightweight async way (fork a sub-process or thread?) then that would be better.

Index update processes have long execution times by nature; Lucene writes are expensive. Updates take longer than the initial indexing. For our purposes, we should seek to avoid too-frequent updates and prioritize batch updates via the bulk API (which allows ES to leverage its own optimizations to reduce run times, minimizes connection times and request overhead especially with SSL, etc). Given our data sets, we have less of a need for immediate sync than for reliable performance. We should use an asynchronous process for updates; a queue with deduplication (to prevent the rare race condition you mentioned) run via a single background worker or using transactional locking -- this also greatly simplifies error handling and prevents replication issues. Initial runs or re-indexing should use ranges executed periodically to avoid delays from queuing large volumes of data. Combining these two methods to cover the two distinct indexing cases should yield the best performance and let us stagger workload more efficiently.

Pagination is also supported by ES, but because ES does not maintain state across API accesses, it is possible that data would be inconsistent when the user pages through the search results (a newly added record could appear in the results) or subsequently exports the results into an Excel file. I don't think this is a problem.

For pagination, results are sorted and can be accessed using the 'size' and 'from' parameters, so combining that with bulk updates as above we can basically ensure a user gets a consistent set of results per user query. We should set a conservative limit for max results up front to prevent queries from getting too intensive.

This can all be discussed further in the search call next week :)

wonderchook commented 8 years ago

@seav how would this work with django-tutelary and the current permissions?

seav commented 8 years ago

@wonderchook, I am assuming that any permissions on records will only be stored in the platform and not baked into the search index because permissions is tied to the user and is not an attribute of the records. So when records are returned as search results, we would need to check the user's permission to view each record and then discard records from the results if the user doesn't have view permissions. In addition, I assume that if a user doesn't have the list permission, then the search function would not be available to them.

This would be complicated and inefficient to implement in a user-friendly way though because we can't say upfront that there are a total of N results because computing N based on permissions is inefficient, and paging through the results would be complex. (You can ask Elasticsearch to return results 11–20 but what happens if the 19th result is discarded? Do you make another query to retrieve the 21st result? But what if that is also discarded?)

An alternative is to display the results whether the user has permission to view the resulting records or not. But if the user follows the link to see a record's detail, then they get a permission denied error. This is not satisfactory from a privacy point of view though because the search results would necessarily leak information about the records.

clash99 commented 7 years ago

I've updated the search wireframes to reflect our discussion about what we can accomplish for phase 1: https://drive.google.com/open?id=0BzpiEtMtHC3rVFlVelgzNHBxaUk

Summary

Added results time stamp
Added entity filter and moved entity tabs to later phase
Moved export to later phase
Moved save action to later phase
Moved advanced search to later phase
Moved icons to later phase
Moved narrow search filters to later phase
Moved changes to other platform pages to later phase

@dpalomino Can you review and let me know if this covers everything? I wanted to have a clear set of wireframes for this next sprint's planning meeting. I will add the additional wireframes but need to discuss the phases with you. Thanks!

clash99 commented 7 years ago

I've updated the search wireframes based on our latest discussion: https://drive.google.com/open?id=0BzpiEtMtHC3rVFlVelgzNHBxaUk

I have the phases broken out as such:

Phase 1

Adding keyword search box
New simplified search results page (no tabs, narrow results options, etc.)
No changes to other platform pages

Phase 2

Adding entity tabs
Adding narrow results options in sidebar
Export results feature
Advanced search feature
No changes to other platform pages

Phase 3 (not included in mocks yet)

Adding save feature
Adding icons to entities
Adding map/list view options
Changes to other platform pages

Additional Options

Auto-suggest feature

dpalomino commented 7 years ago

Thanks really a lot @clash99 for the fantastic wireframes! I think that covers everything for phase 1, and for the following.

@seav @linzjax please let us know if you this makes sense to you.

oliverroick commented 7 years ago

Some things we should discuss before we start implementing the search UI.

General

The timeline to complete the full implementation until the end of sprint 13 is unrealistic. We are already halfway through sprint 11; I think, phase one and two will take at least two sprints each.

Phase 1

Sort dropdown: “Sort by relevance” — is this something we get from Elastic search?
Pagination: We need to think about how we can do this. Currently, we have pagination only available for table views; we are using the datatables plugin to add pagination to tables. The search result can not be shown as a table; it will probably require some customized solution that we would need to build ourselves.
There’s a link for “search guidelines” — who is going to write them?

Phase 2

I don’t like the split between tabs on top of the search results and “narrow results” checkboxes on the left. They should not be separated because they both address the same goal: Narrowing down the search results.

Wouldn’t it be possible to combine both into the left-hand-side checkbox panel? Initially, you get checkboxes for locations, relationships, parties and resources. When you select a checkbox, the subcategories expand, e.g. Community Boundary and Parcel for locations.

dpalomino commented 7 years ago

Thanks a lot @oliverroick for the feedback, let's discuss later on during the dev call.

Regarding the "search guidelines", I can draft the texts for this.

dpalomino commented 7 years ago

I've drafted some search guidelines in this doc: https://docs.google.com/document/d/16RevzILdt8HrWd5TEX9D5R7nQoEef1kf4x104FsUEXU/edit#

Phase 1: only basic query build advices
Phase 2: Phase 1, plus:
- help for advanced queries
- filters (sidebar, entity tabs), downloads
- help for the advanced search GUI
Phase 3: TBD

clash99 commented 7 years ago

https://drive.google.com/open?id=0BzpiEtMtHC3rVFlVelgzNHBxaUk

I've updated the search wireframes with the following:

Removed sprint numbers
Reworked search to exclude tabs and only be in right sidebar. Dynamic presentation of options will be based upon users selection. The presents more of a funnel-down approach. The tab option is still viable if we would like to discuss as a team.

dpalomino commented 7 years ago

Apart from the bugfixing that could appear, I think we can move this issue to "Released" state.

Cadasta / cadasta-platform

Records search feature #825

Detailed Feature Description

User Stories

Requirements

Text-based queries

Text-based queries for specific fields

Number-based queries for specific fields

Search box

Presentation of search results

Access to the details of each result

Exporting search results

Implementation details

Infrastructure

Elastic Search Data Structure

Data Flow

Making Queries

Ancillary Information

Wireframes

Infrastructure

ES data structure

Data flow

Making queries

Django innards

Testing

General

Phase 1

Phase 2