rufuspollock commented 4 years ago

Create a basic wrapper service around Hasura in NodeJS that provides CKAN datastore API style features.

Acceptance

We have an an API like this where dapi-id is probably the table or view name ...

[x] graphql endpoint - maybe comes straight from Hasura

[x] datastore_search endpoint

# dapi-endpoint includes version e.g. data-api.ckan.com/v1/ ...
/v1/graphql  # raw Hasura - the table name is part of graphql query
/v1/datastore_search  # CKAN DataStore /api/3/datastore_search

[x] Docker compose setup (now we have 2 services)
[x] CI for this ...

Acceptance criteria

[x] The data-api app provides graphql (route to hasura) endpoint
[x] The data-api app provides datastore_search endpoint

Tasks

GraphQL endpoint

[x] write a test checking redirection
[x] implement the redirection
[x] write a test checking that response is streaming (not consolidating data) -> see issue #11

`datastore_search` as in Datastore extention

This endpoint response format is incompatible with GraphQL => will need it's own implementation. Also some features might be tricky to implement, e.g. Setting the plain flag to false enables the entire PostgreSQL full text search query language. involves setting up postgres indexes.

[x] analyse what the endpoint does https://docs.ckan.org/en/2.9/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_search
[x] add basic endpoint with getting by resource_id
[x] add filters
[x] distinct ?? GraqphQL supports distinct_by with a list of columns, we could support all columns by default to keep backwards compatibility and add the field-wise distinct support
[x] limit (keep default 100)
[x] offset -> pagination in graphql has 2 different implementations with offset and with cursor ...check which to use
[x] fields
[x] sort -> GraphQL supports order_by, we'll support only [asc|desc] in a json like format: {fieldname1:asc, fieldname2:desc} -> see issue #11
[ ] include total(bool) total matching record count (int) default:false -> see issue #11
[x] records_format with JSON by default
- [x] csv -> see issue #11
- [x] tsv -> see issue #11
[x] add q param

Other activities

[ ] Add simple authentication handling/validation (maybe JWT?) -> see issue #12
[x] document what is different from https://docs.ckan.org/en/2.9/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_search

return format . CSV, dict, json ... Options for return format selection:

records_format in the query string (as in old datastore_search)
- backward compatible with current CKAN
- can be read by human user checking the query string
this/query/path/CSV/datastore_search
- easy to implement
- can be read by user in the URL
http header
- (probably) already implemented by the frameworks

Analysis

CKAN - New API Differences and Analysis

Here we document the differences between this new API and [CKAN]() version

By default all parameters and results from the new API are JSON, this differs from CKAN in which the get call parameters are non standard but have an ad-hoc syntax. This is to

General Differences CKAN vs New API

The new API is entirely based on a GraphQL DB querying system, this changes the ways in which we can make queries.

CKAN targets a Postgresql DB, the difference with a GraphQL DB is that the queries that can be implemented are quite different.

GraphQL queries need to be explicit, including even the fields ("columns" in a traditional DB), which means knowing the database and "table" schema before making the query to be able to ask for the needed fields. This means that for

GET Parameters

This section discusses each CKAN parameter and its implementation (or not) in the new API;

Parameters:

resource_id (string) – id or alias of the resource to be searched against. Mandatory parameter, implemented

filters (dictionary) – matching conditions to select, e.g {“key1”: “a”, “key2”: “b”} (optional). Optional parameter, use q query instead in the New API.

q (string or dictionary) – full text query. If it’s a string, it’ll search on all fields on each row. If it’s a dictionary as {“key1”: “a”, “key2”: “b”}, it’ll search on each specific field (optional)

This field is different in the new API, the main difference is that the new API only receives JSON. Current New API implementationis equivalent to filters.

distinct (bool) – return only distinct rows (optional, default: false)

This parameter is vastly different from the previous CKAN implementation. In GraphQL there is no notion of row (due to the simple fact that graphs do not have rows), this means that the option distinct does not mean the same for the new API and that we can also implement a new idea.

Due to the differences and to make it evident that the API is not the same the new implementation is called distinct_on and needs a list of fields to test for differences, so the new implementation can check for differences for each field.

The New API does also implement (for backwards compatibility) a boolean value where it will query the graph schema and ask for different in every field of the schema, which would be equivalent to the CKAN distinct implementation.

plain (bool) – treat as plain text query (optional, default: true) language (string) – language of the full text query (optional, default: english)

Full text search was discussed to not be implemented in the github issue

limit (int) – maximum number of rows to return (optional, default: 100, unless set in the site’s configuration ckan.datastore.search.rows_default, upper limit: 32000 unless set in site’s configuration ckan.datastore.search.rows_max)

There is no difference in the implementation of this parameter

offset (int) – offset this number of rows (optional)

This parameter that implies pagination to the response has not been completely implemented in the new API. The reason for this is that in GraphQL there are 2 different ways of implementing pagination and this needs to be analyzed more in depth.

fields (list or comma separated string) – fields to return (optional, default: all fields in original order)

List of fields to return to the caller. The only difference is that in the New API the input is a list (JSON)

sort (string) – comma separated field names with ordering e.g.: “fieldname1, fieldname2 desc”

Not included yet in the New API, and the only difference with the new api is that it will be implemented with a JSON input instead of a comma separated field names. The input parameter will have the following format:

{ fieldname1: order, fieldname2: order} where order = [asc|desc]

include_total (bool) – True to return total matching record count (optional, default: true)

Not Implemented in the New API, this is CPU intensive and needs an extra query or process the result to count the number of resulting elements.

total_estimation_threshold (int or None) – If “include_total” is True and “total_estimation_threshold” is not None and the estimated total (matching record count) is above the “total_estimation_threshold” then this datastore_search will return an estimate of the total, rather than a precise one. This is often good enough, and saves computationally expensive row counting for larger results (e.g. >100000 rows). The estimated total comes from the PostgreSQL table statistics, generated when Express Loader or DataPusher finishes a load, or by autovacuum. NB Currently estimation can’t be done if the user specifies ‘filters’ or ‘distinct’ options. (optional, default: None)

This is not feasible with the current graphql DB except for the global statistics on the schema. To be able to estimate the count we would need some other aggregation statistics on the different kind of queries.

records_format (controlled list) – the format for the records return value: ‘objects’ (default) list of {fieldname1: value1, …} dicts, ‘lists’ list of [value1, value2, …] lists, ‘csv’ string containing comma-separated values with no header, ‘tsv’ string containing tab-separated values with no header Setting the plain flag to false enables the entire PostgreSQL full text search query language.

The result format in the New API is JSON, in the future and if needed another return format can be implemented

Results:

The result returned to the caller is a JSON containing the following:

{ 
    schema: {JSON schema definition},
    data: [JSON list of elements]

}

The following list shows the CKAN return elements and the implementation in the New API

fields (list of dictionaries) – fields/columns and their extra metadata
- This is not returned as in a JSON format it is not needed (it is already present in the return ) and the extra metadata is present in the schema field
offset (int) – query offset value
- Not currently implemented as a return value
limit (int) – queried limit value (if the requested limit was above the ckan.datastore.search.rows_max value then this response limit will be set to the value of ckan.datastore.search.rows_max)
- Not currently implemented as a return value
filters (list of dictionaries) – query filters
- Not currently implemented as a return value
total (int) – number of total matching records
- Not currently implemented as a return value. Needs more analysis and development to be implemented.
total_was_estimated (bool) – whether or not the total was estimated
- Will not implement
records (depends on records_format value passed) – list of matching results
- This one now is named data

Response Size

Depending on the response size the data transfer can have ill network effects or overloading the server connections. This is why asynchronous and streaming response data will be needed.

The fact that the current implementation responds in JSON by default adds a data overload that is simply not there in formats like CSV which do not repeat the field names for every response element.

Another option for big size response is to return a SFTP URL (or other secure file transfer protocol) instead. This URL will contain a file with the response data once the operation is complete. This way the connection can be freed and the client can poll for the file.

This solution has two extra advantages:

the file acts like a cache and can be used in intermediate computations
if there is any problem during the file transfer the result download can be restarted without recomputing the response and partial downloads are supported in most file transfer protocols.

Security

Security needs to be implemented, one option is JWT which allows for signed requests in the GET query.

JWT has the advantage of already being compatible with the current JSON parameter implementation in the New API.

Example 1

Input

/?resource_id=test_id

Test table is a table like this:

| _id | a | b | c |

Incoming query in structured format

{
  resource_id: test_table
}

Intermediate structure "JSON graqhql"

{
  table: test_table
  fields: [...]
}

Final graphql to send to client

{
  test_table {
    _id,
    a,
    b,
    c,
  }
}

Returns from Graphql

Returns to user

EvgeniiaVak commented 3 years ago

@rufuspollock Reimplementing datastore_search seems to be not a simple task:

some features might be tricky to implement, e.g. Setting the plain flag to false enables the entire PostgreSQL full text search query language. involves setting up postgres indexes.

Do we want to do that?

✔️ It's just for backwards compatibility, so if implementation is not trivial, it is better to redirect to CKAN
We need to completely replace Datastore extension at some point and still have all of its features => we need to reimplement even if it takes longer

rufuspollock commented 3 years ago

@EvgeniiaVak i would say approach here is to support the API as much as is reasonable and note where we deviate. Here i think the full text indexing was a mistake anyway so not supporting it is a good idea anyway ...

I'd really be focusing on the basic part of datastore_search which is ability to do a free query q or do filter by field. Frankly, i would be doing the minimum here and flagging where we therefore deviate from pre-existing setup.

leomrocha commented 3 years ago

@rufuspollock @EvgeniiaVak there is another reason which makes full text search in most databases an issue (appart from the search performance and indexing part). The full text search algorithms behind the scenes are mostly done for English language (Postgresql being one of those examples), even if there are a few algorithms available for others (such as French), non-english support is not good. IMHO full text search should be only done if absolutely needed and with the good backend (Solr, Elasticsearch)

EvgeniiaVak commented 3 years ago

@leomrocha so we decided to not to implement q request parameter at all (because it uses the full-text search functionality), but maybe that's not very right, because judging by the name datastore_search it is meant for finding a resource. And also Rufus' comment was:

I'd really be focusing on the basic part of datastore_search which is ability to do a free query q or do filter by field.

So we really need this q, let's just implement some simpler search - e.g. with https://lunrjs.com/ by just dataset names for now.

EvgeniiaVak commented 3 years ago

@leomrocha Although it's a bit strange that the current datastore_search returns both resource description and data all at the same endpoint. As a client app, I would first find the needed resource and only then request its data by id (both these actions are supported right now) but I would expect getting data to be different endpoint 🤔

or by using ilike from hasura https://hasura.io/blog/full-text-search-with-hasura-graphql-api-postgres/

leomrocha commented 3 years ago

@EvgeniiaVak

@leomrocha Although it's a bit strange that the current datastore_search returns both resource description and data all at the same endpoint.

current as in current CKAN or current as the code in this repo?

As a client app, I would first find the needed resource and only then request its data by id (both these actions are supported right now) but I would expect getting data to be different endpoint

It makes sense

or by using ilike from hasura https://hasura.io/blog/full-text-search-with-hasura-graphql-api-postgres/ I don't see the relation with the resource description here. This is full text search if I understand correctly

leomrocha commented 3 years ago

@rufuspollock

With @EvgeniiaVak we are discussing about how to implement in a clear way the endpoints: We are thinking of the following endpoints that might be of use, but we don't have full text search. If we understand correctly the q search would be a free search or filter search

/datastore_search/?resource_id /datastore_search/help -> by default if the query is not valid /list_resources

For the filter search we'll change the format to JSON in the request query, that will convert inside to an sql or graphql query (most probably we'll start with graphql and do sql only if needed)

/datastore_search/?resource_id=...&q=....

@rufuspollock for the filter case, we would do filtering by one single resource, but is it needed to do a more general filtering considering several resources instead?

rufuspollock commented 3 years ago

AGREED next actions @leomrocha

[x] Tidy up and commit what we have
[x] Summarize what is and isn't done (probably in the README with link to further work issue)
- [x] Open an issue for further work with what isn't done
- [x] Notes in this issue about where there is a mismatch: graphql vs ckan datastore api
- [x] recommendations for how we change datastore_search
[ ] Tidy up analysis section with example of how we amp across atm and what, if any, libraries are out there for building graphql query builder) - copy and paste notes.

leomrocha commented 3 years ago

@rufuspollock Is the analysis OK for you or you also want the examples here in the analysis? As a note, there are examples in the README.md file with screenshots on the responses on a local server.

If OK by you I'll take out the Examples section in the current issue's Analysis section.

leomrocha commented 3 years ago

@EvgeniiaVak @rufuspollock Should we close this issue now? CI should be issue #4 and analysis section needs @rufuspollock OK

EvgeniiaVak commented 3 years ago

FIXED. See https://github.com/datopian/data-api/pull/13 and https://github.com/datopian/data-api/pull/16

datopian / data-api

Basic API wrapper around Hasura with endpoints and CKAN datastore API #7

Acceptance

Acceptance criteria

Tasks

GraphQL endpoint

`datastore_search` as in Datastore extention

Other activities

Analysis

CKAN - New API Differences and Analysis

General Differences CKAN vs New API

GET Parameters

Parameters:

Results:

Response Size

Security

Example 1

Input

Incoming query in structured format

Intermediate structure "JSON graqhql"

Final graphql to send to client

Returns from Graphql

Returns to user

datopian / data-api

Basic API wrapper around Hasura with endpoints and CKAN datastore API #7

Acceptance

Acceptance criteria

Tasks

GraphQL endpoint

datastore_search as in Datastore extention

Other activities

Analysis

CKAN - New API Differences and Analysis

General Differences CKAN vs New API

GET Parameters

Parameters:

Results:

Response Size

Security

Example 1

Input

Incoming query in structured format

Intermediate structure "JSON graqhql"

Final graphql to send to client

Returns from Graphql

Returns to user

`datastore_search` as in Datastore extention