Search page - Githubissues

Search support like CKAN Classic in Next Gen frontend.

User Stories

As a User I want to explore and find datasets on the platform so that I can see what is there and find what I want quickly (or know it does not exist)

As a User, I want to browse all datasets available on the data portal so that I can quickly find what I need.

As a User, I want to have search functionality so that I can type in keywords and get list of datasets that satisfies my needs.

We need to replicate CKAN search UI

Key features

General search query
- Query in the url (people can share the url)
Facets displayed e.g.
- Organizations
- Tags
- Groups
- Formats ...
Facets click supported ...
Sort by Relevance or Name

Acceptance criteria

[ ] Search page like demo.ckan.org
- [ ] Main text search
- [ ] Facets
[ ] Also can support DataHub.io
[ ] Forward looking (ElasticSearch) - i.e. easy to plug in Elastic Search setup

Tasks

[x] Research existing CKAN and DataHub setup
[ ] Define the API
- [ ] Research ES query structure
[ ] Wire it up with CKAN
- [ ] Convert CKAN result structure to our structure
- [ ] ckanPackage2DataPackage function
- [ ] normalizeCkan() - small function that converts CKAN response into metastore like reponse
- [ ] get list of top datasets to display (depends on how do we sort by default)
- [ ] search box - simply query the API
[ ] Mock some facet data based on CKAN like facet data (also check out ES facet structure)
- [ ] Use it
[ ] Support for form info
- [ ] e.g. relevance
- [ ] clicking on facets
[ ] Update the search template to have missing features
- [ ] facets
- [ ] sort by drop down

Recommendation

search URI API

TODO: map from CKAN to our support

Support q?...

Logic API

api.search(query, context)

// note on query - some ambiguity between this object and how it is serialized "over the wire"
// cf https://www.elastic.co/guide/en/elasticsearch/reference/current/search-uri-request.html - look at parameters vs the DSL
// here we are a bit more DSL oriented
query = {
  // query string query (either passed direct to ckan or to metastore or to ES if we were using that. [Future] we'd like to follow ES syntax more
  // named q following ES convention for its search API https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
  // worth looking at how this works under the hood ...
  // cf https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
  q: "abc title:jones"

  // filters TODO ...

  // standard
  size: 10,
  from: 10
  sort: {
    field: direction
  }
}

// any other context to pass e.g. auth token etc
context = {
  token: ...
}

Results:

{
  "results": [
    list of Data Packages
  ]

  "count": 12,
  "sort": "score desc, metadata_modified desc",
  "facets": {},
  "search_facets": {},
}

Templating

Template gets:

query
results

Page

<input name="q" value="{{query.q}}">
for each filter in filter ...

Sequence diagram of Search flow

Below is very detailsed flow of how it is goind to work. Assume we're working with CKAN backend:

sequenceDiagram
    Controller->>Model: Standard Query
    Model-->>Utils: I need to call CKAN/Datahub backend
    Utils-->>Model: CKAN/Datahub Query
    Model-->>CKAN/Datahub: CKAN/Datahub Query
    CKAN/Datahub-->>Model: CKAN/Datahub like result
    Model-->>Utils: I need standard result
    Utils-->>Model: Standard result
    Model->>Controller: Standard result

Analysis

UI for search

API for search

Query object


// DataHub
{
  q: "...", // match-all query string will search the following properties title, datahub.owner, datahub.ownerid, datapackage.readme
  size: 10,
  from: 20
}

// CKAN

{
  q: (string) – the solr query. Optional. Default: "*:*"
  fq: (string) – any filter queries to apply. Note: +site_id:{ckan_site_id} is added to this string prior to the query being executed.

  // facets
  facet (string) – whether to enable faceted results. Default: True.
  facet.mincount (int) – the minimum counts for facet fields should be included in the results.
  facet.limit (int) – the maximum number of values the facet fields return. A negative value means unlimited. This can be set instance-wide with the search.facets.limit config option. Default is 50.
  facet.field (list of strings) – the fields to facet upon. Default empty. If empty, then the returned facet information is empty.

  // standard
  rows: (int) – the number of matching rows to return. There is a hard limit of 1000 datasets per query.
  start: (int) – the offset in the complete result for where the set of returned datasets should begin.
  sort: (string) – sorting of the search results. Optional. Default: 'relevance asc, metadata_modified desc'. As per the solr documentation, this is a comma-separated string of field names and sort-orderings.

  // data catalog specific / business logic
  include_drafts: (bool) – if True, draft datasets will be included in the results. A user will only be returned their own draft datasets, and a sysadmin will be returned all draft datasets. Optional, the default is False.
  include_private: (bool) – if True, private datasets will be included in the results. Only private datasets from the user’s organizations will be returned and sysadmins will be returned all private datasets. Optional, the default is False.
  use_default_schema: (bool) – use default package schema instead of a custom schema defined with an IDatasetForm plugin (default: False)
}

Result


// DataHub
"results": [
  ... // list of data packages
],
"summary": {
  "total": 1109,
  "totalBytes": 61093521506
}

// CKAN

"results": [
  ...
],
"count": 1,
"sort": "score desc, metadata_modified desc",
"facets": {},

Situation

Elasticsearch

TODO:

Understand faceting (aka aggregations)
Decide on query structure and how much we parse vs pass down

{
  "query": {
    "match_all": {}
  },
  size: 10,
  from: 10
  sort: {
    field
  }
}

// query object
{
  q: query string query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax)
  match_all:
  match: ...
}

Results

// core (without all surrounding metadata)
{
  // results
  "hits" : [
    {
      "_index" : "twitter",
      "_type" : "_doc",
      "_id" : "0",
      "_score": 1.3862944,

      // actual document
      "_source" : {
          "user" : "kimchy",
          "date" : "2009-11-15T14:12:12",
          "message" : "trying out Elasticsearch",
          "likes": 0
      }
    }
  ]

  // summary
  "total" : {
    "value": 1,
    "relation": "eq"
  },
  "max_score": 1.3862944,
}

// full version - 
{
    "timed_out": false,
    "took": 62,
    "_shards":{
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
    },
    "hits":{
        "total" : {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.3862944,
        "hits" : [
            {
                "_index" : "twitter",
                "_type" : "_doc",
                "_id" : "0",
                "_score": 1.3862944,
                "_source" : {
                    "user" : "kimchy",
                    "date" : "2009-11-15T14:12:12",
                    "message" : "trying out Elasticsearch",
                    "likes": 0
                }
            }
        ]
    }
}

DataHub

What is search results that datahub currently expects ... From MetaStore which is a thin wrapper around elasticsearch

Frontend code that calls search API - Current DataHub code is mixed a bit between "logic" layer and "controller".
- https://github.com/datopian/frontend/blob/master/lib/index.js#L146-L149
- https://github.com/datopian/frontend/blob/master/routes/index.js#L1187-L1194
Search API documented - https://github.com/datopian/metastore#api

const result = api.search(query, token)

// query is
query = {
  q: "...", // match-all query string will search the following properties title, datahub.owner, datahub.ownerid, datapackage.readme
  size: 10,
  from: 20
}

// 'token' is passed in the headers so user can get his/her own unlisted and private datasets
token = "123" 

// result looks like 

{
  "results": [
    ... // list of data packages
  ],
  "summary": {
    "total": 1109,
    "totalBytes": 61093521506
  }
}

CKAN

Routes

/search
/search?q=...&...

https://docs.ckan.org/en/2.8/api/index.html#ckan.logic.action.get.package_search

code: https://github.com/ckan/ckan/blob/master/ckan/logic/action/get.py#L1699
https://demo.ckan.org/api/3/action/package_search?q=spending
https://demo.ckan.org/api/3/action/resource_search?query=name:District%20Names
Search query is a q parameter in the URL

package_search(context, data_dict)

// Query is in data_dict and specicifed here 
// https://docs.ckan.org/en/2.8/api/index.html#ckan.logic.action.get.package_search
{
  q: (string) – the solr query. Optional. Default: "*:*"
  fq: (string) – any filter queries to apply. Note: +site_id:{ckan_site_id} is added to this string prior to the query being executed.

  // facets
  facet (string) – whether to enable faceted results. Default: True.
  facet.mincount (int) – the minimum counts for facet fields should be included in the results.
  facet.limit (int) – the maximum number of values the facet fields return. A negative value means unlimited. This can be set instance-wide with the search.facets.limit config option. Default is 50.
  facet.field (list of strings) – the fields to facet upon. Default empty. If empty, then the returned facet information is empty.

  // standard
  rows: (int) – the number of matching rows to return. There is a hard limit of 1000 datasets per query.
  start: (int) – the offset in the complete result for where the set of returned datasets should begin.
  sort: (string) – sorting of the search results. Optional. Default: 'relevance asc, metadata_modified desc'. As per the solr documentation, this is a comma-separated string of field names and sort-orderings.

  // data catalog specific / business logic
  include_drafts: (bool) – if True, draft datasets will be included in the results. A user will only be returned their own draft datasets, and a sysadmin will be returned all draft datasets. Optional, the default is False.
  include_private: (bool) – if True, private datasets will be included in the results. Only private datasets from the user’s organizations will be returned and sysadmins will be returned all private datasets. Optional, the default is False.
  use_default_schema: (bool) – use default package schema instead of a custom schema defined with an IDatasetForm plugin (default: False)
}

// ## Result

{
  "result": { // real result 
    "count": 1,
    "sort": "score desc, metadata_modified desc",
    "facets": {},
    "results": [
      ...
    ]
  },

  // std api wrapper stuff
  "help": "https://demo.ckan.org/api/3/action/help_show?name=package_search",
  "success": true,
}

datopian / frontend-v2

Search page #135