datopian / frontend-v2

CKAN / Data Portal frontend as microservice in pure Javascript (Node).
http://tech.datopian.com/frontend/
MIT License
38 stars 18 forks source link

Search page #135

Open rufuspollock opened 4 years ago

rufuspollock commented 4 years ago

Search support like CKAN Classic in Next Gen frontend.

User Stories

As a User I want to explore and find datasets on the platform so that I can see what is there and find what I want quickly (or know it does not exist)

As a User, I want to browse all datasets available on the data portal so that I can quickly find what I need.

As a User, I want to have search functionality so that I can type in keywords and get list of datasets that satisfies my needs.

We need to replicate CKAN search UI

Key features

Acceptance criteria

Tasks

Recommendation

search URI API

TODO: map from CKAN to our support

Support q?...

Logic API

api.search(query, context)

// note on query - some ambiguity between this object and how it is serialized "over the wire"
// cf https://www.elastic.co/guide/en/elasticsearch/reference/current/search-uri-request.html - look at parameters vs the DSL
// here we are a bit more DSL oriented
query = {
  // query string query (either passed direct to ckan or to metastore or to ES if we were using that. [Future] we'd like to follow ES syntax more
  // named q following ES convention for its search API https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
  // worth looking at how this works under the hood ...
  // cf https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
  q: "abc title:jones"

  // filters TODO ...

  // standard
  size: 10,
  from: 10
  sort: {
    field: direction
  }
}

// any other context to pass e.g. auth token etc
context = {
  token: ...
}

Results:

{
  "results": [
    list of Data Packages
  ]

  "count": 12,
  "sort": "score desc, metadata_modified desc",
  "facets": {},
  "search_facets": {},
}

Templating

Template gets:

Page

<input name="q" value="{{query.q}}">
for each filter in filter ...

Sequence diagram of Search flow

Below is very detailsed flow of how it is goind to work. Assume we're working with CKAN backend:

sequenceDiagram
    Controller->>Model: Standard Query
    Model-->>Utils: I need to call CKAN/Datahub backend
    Utils-->>Model: CKAN/Datahub Query
    Model-->>CKAN/Datahub: CKAN/Datahub Query
    CKAN/Datahub-->>Model: CKAN/Datahub like result
    Model-->>Utils: I need standard result
    Utils-->>Model: Standard result
    Model->>Controller: Standard result

Analysis

UI for search

API for search

Query object


// DataHub
{
  q: "...", // match-all query string will search the following properties title, datahub.owner, datahub.ownerid, datapackage.readme
  size: 10,
  from: 20
}

// CKAN

{
  q: (string) – the solr query. Optional. Default: "*:*"
  fq: (string) – any filter queries to apply. Note: +site_id:{ckan_site_id} is added to this string prior to the query being executed.

  // facets
  facet (string) – whether to enable faceted results. Default: True.
  facet.mincount (int) – the minimum counts for facet fields should be included in the results.
  facet.limit (int) – the maximum number of values the facet fields return. A negative value means unlimited. This can be set instance-wide with the search.facets.limit config option. Default is 50.
  facet.field (list of strings) – the fields to facet upon. Default empty. If empty, then the returned facet information is empty.

  // standard
  rows: (int) – the number of matching rows to return. There is a hard limit of 1000 datasets per query.
  start: (int) – the offset in the complete result for where the set of returned datasets should begin.
  sort: (string) – sorting of the search results. Optional. Default: 'relevance asc, metadata_modified desc'. As per the solr documentation, this is a comma-separated string of field names and sort-orderings.

  // data catalog specific / business logic
  include_drafts: (bool) – if True, draft datasets will be included in the results. A user will only be returned their own draft datasets, and a sysadmin will be returned all draft datasets. Optional, the default is False.
  include_private: (bool) – if True, private datasets will be included in the results. Only private datasets from the user’s organizations will be returned and sysadmins will be returned all private datasets. Optional, the default is False.
  use_default_schema: (bool) – use default package schema instead of a custom schema defined with an IDatasetForm plugin (default: False)
}

Result


// DataHub
"results": [
  ... // list of data packages
],
"summary": {
  "total": 1109,
  "totalBytes": 61093521506
}

// CKAN

"results": [
  ...
],
"count": 1,
"sort": "score desc, metadata_modified desc",
"facets": {},

Situation

Elasticsearch

TODO:

{
  "query": {
    "match_all": {}
  },
  size: 10,
  from: 10
  sort: {
    field
  }
}

// query object
{
  q: query string query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax)
  match_all:
  match: ...
}

Results

// core (without all surrounding metadata)
{
  // results
  "hits" : [
    {
      "_index" : "twitter",
      "_type" : "_doc",
      "_id" : "0",
      "_score": 1.3862944,

      // actual document
      "_source" : {
          "user" : "kimchy",
          "date" : "2009-11-15T14:12:12",
          "message" : "trying out Elasticsearch",
          "likes": 0
      }
    }
  ]

  // summary
  "total" : {
    "value": 1,
    "relation": "eq"
  },
  "max_score": 1.3862944,
}

// full version - 
{
    "timed_out": false,
    "took": 62,
    "_shards":{
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
    },
    "hits":{
        "total" : {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.3862944,
        "hits" : [
            {
                "_index" : "twitter",
                "_type" : "_doc",
                "_id" : "0",
                "_score": 1.3862944,
                "_source" : {
                    "user" : "kimchy",
                    "date" : "2009-11-15T14:12:12",
                    "message" : "trying out Elasticsearch",
                    "likes": 0
                }
            }
        ]
    }
}

DataHub

What is search results that datahub currently expects ... From MetaStore which is a thin wrapper around elasticsearch

const result = api.search(query, token)

// query is
query = {
  q: "...", // match-all query string will search the following properties title, datahub.owner, datahub.ownerid, datapackage.readme
  size: 10,
  from: 20
}

// 'token' is passed in the headers so user can get his/her own unlisted and private datasets
token = "123" 

// result looks like 

{
  "results": [
    ... // list of data packages
  ],
  "summary": {
    "total": 1109,
    "totalBytes": 61093521506
  }
}

CKAN

Routes

/search
/search?q=...&...

https://docs.ckan.org/en/2.8/api/index.html#ckan.logic.action.get.package_search

package_search(context, data_dict)

// Query is in data_dict and specicifed here 
// https://docs.ckan.org/en/2.8/api/index.html#ckan.logic.action.get.package_search
{
  q: (string) – the solr query. Optional. Default: "*:*"
  fq: (string) – any filter queries to apply. Note: +site_id:{ckan_site_id} is added to this string prior to the query being executed.

  // facets
  facet (string) – whether to enable faceted results. Default: True.
  facet.mincount (int) – the minimum counts for facet fields should be included in the results.
  facet.limit (int) – the maximum number of values the facet fields return. A negative value means unlimited. This can be set instance-wide with the search.facets.limit config option. Default is 50.
  facet.field (list of strings) – the fields to facet upon. Default empty. If empty, then the returned facet information is empty.

  // standard
  rows: (int) – the number of matching rows to return. There is a hard limit of 1000 datasets per query.
  start: (int) – the offset in the complete result for where the set of returned datasets should begin.
  sort: (string) – sorting of the search results. Optional. Default: 'relevance asc, metadata_modified desc'. As per the solr documentation, this is a comma-separated string of field names and sort-orderings.

  // data catalog specific / business logic
  include_drafts: (bool) – if True, draft datasets will be included in the results. A user will only be returned their own draft datasets, and a sysadmin will be returned all draft datasets. Optional, the default is False.
  include_private: (bool) – if True, private datasets will be included in the results. Only private datasets from the user’s organizations will be returned and sysadmins will be returned all private datasets. Optional, the default is False.
  use_default_schema: (bool) – use default package schema instead of a custom schema defined with an IDatasetForm plugin (default: False)
}

// ## Result

{
  "result": { // real result 
    "count": 1,
    "sort": "score desc, metadata_modified desc",
    "facets": {},
    "results": [
      ...
    ]
  },

  // std api wrapper stuff
  "help": "https://demo.ckan.org/api/3/action/help_show?name=package_search",
  "success": true,
}
rufuspollock commented 4 years ago

@anuveyatsu can you update this e.g. check those items that were implemented and leave a detailed comment here with the status. If it is all done we can close as fixed and link to the docs issue at https://gitlab.com/datopian/tech/tech.datopian.com/issues/5