datopian / data-api

Next generation Data API for data management systems including CKAN.
https://tech.datopian.com/data-api/
MIT License
9 stars 3 forks source link

Basic API wrapper around Hasura with endpoints and CKAN datastore API #7

Closed rufuspollock closed 3 years ago

rufuspollock commented 4 years ago

Create a basic wrapper service around Hasura in NodeJS that provides CKAN datastore API style features.

Acceptance

We have an an API like this where dapi-id is probably the table or view name ...

Acceptance criteria

Tasks

GraphQL endpoint

datastore_search as in Datastore extention

This endpoint response format is incompatible with GraphQL => will need it's own implementation. Also some features might be tricky to implement, e.g. Setting the plain flag to false enables the entire PostgreSQL full text search query language. involves setting up postgres indexes.

Other activities

return format . CSV, dict, json ... Options for return format selection:

Analysis

CKAN - New API Differences and Analysis

Here we document the differences between this new API and [CKAN]() version

By default all parameters and results from the new API are JSON, this differs from CKAN in which the get call parameters are non standard but have an ad-hoc syntax. This is to

General Differences CKAN vs New API

The new API is entirely based on a GraphQL DB querying system, this changes the ways in which we can make queries.

CKAN targets a Postgresql DB, the difference with a GraphQL DB is that the queries that can be implemented are quite different.

GraphQL queries need to be explicit, including even the fields ("columns" in a traditional DB), which means knowing the database and "table" schema before making the query to be able to ask for the needed fields. This means that for

GET Parameters

This section discusses each CKAN parameter and its implementation (or not) in the new API;

Parameters:

resource_id (string) – id or alias of the resource to be searched against. Mandatory parameter, implemented

filters (dictionary) – matching conditions to select, e.g {“key1”: “a”, “key2”: “b”} (optional). Optional parameter, use q query instead in the New API.

q (string or dictionary) – full text query. If it’s a string, it’ll search on all fields on each row. If it’s a dictionary as {“key1”: “a”, “key2”: “b”}, it’ll search on each specific field (optional)

This field is different in the new API, the main difference is that the new API only receives JSON. Current New API implementationis equivalent to filters.

distinct (bool) – return only distinct rows (optional, default: false)

This parameter is vastly different from the previous CKAN implementation. In GraphQL there is no notion of row (due to the simple fact that graphs do not have rows), this means that the option distinct does not mean the same for the new API and that we can also implement a new idea.

Due to the differences and to make it evident that the API is not the same the new implementation is called distinct_on and needs a list of fields to test for differences, so the new implementation can check for differences for each field.

The New API does also implement (for backwards compatibility) a boolean value where it will query the graph schema and ask for different in every field of the schema, which would be equivalent to the CKAN distinct implementation.

plain (bool) – treat as plain text query (optional, default: true) language (string) – language of the full text query (optional, default: english)

Full text search was discussed to not be implemented in the github issue

limit (int) – maximum number of rows to return (optional, default: 100, unless set in the site’s configuration ckan.datastore.search.rows_default, upper limit: 32000 unless set in site’s configuration ckan.datastore.search.rows_max)

There is no difference in the implementation of this parameter

offset (int) – offset this number of rows (optional)

This parameter that implies pagination to the response has not been completely implemented in the new API. The reason for this is that in GraphQL there are 2 different ways of implementing pagination and this needs to be analyzed more in depth.

fields (list or comma separated string) – fields to return (optional, default: all fields in original order)

List of fields to return to the caller. The only difference is that in the New API the input is a list (JSON)

sort (string) – comma separated field names with ordering e.g.: “fieldname1, fieldname2 desc”

Not included yet in the New API, and the only difference with the new api is that it will be implemented with a JSON input instead of a comma separated field names. The input parameter will have the following format:

{ fieldname1: order, fieldname2: order} where order = [asc|desc]

include_total (bool) – True to return total matching record count (optional, default: true)

Not Implemented in the New API, this is CPU intensive and needs an extra query or process the result to count the number of resulting elements.

total_estimation_threshold (int or None) – If “include_total” is True and “total_estimation_threshold” is not None and the estimated total (matching record count) is above the “total_estimation_threshold” then this datastore_search will return an estimate of the total, rather than a precise one. This is often good enough, and saves computationally expensive row counting for larger results (e.g. >100000 rows). The estimated total comes from the PostgreSQL table statistics, generated when Express Loader or DataPusher finishes a load, or by autovacuum. NB Currently estimation can’t be done if the user specifies ‘filters’ or ‘distinct’ options. (optional, default: None)

This is not feasible with the current graphql DB except for the global statistics on the schema. To be able to estimate the count we would need some other aggregation statistics on the different kind of queries.

records_format (controlled list) – the format for the records return value: ‘objects’ (default) list of {fieldname1: value1, …} dicts, ‘lists’ list of [value1, value2, …] lists, ‘csv’ string containing comma-separated values with no header, ‘tsv’ string containing tab-separated values with no header Setting the plain flag to false enables the entire PostgreSQL full text search query language.

The result format in the New API is JSON, in the future and if needed another return format can be implemented

Results:

The result returned to the caller is a JSON containing the following:

{ 
    schema: {JSON schema definition},
    data: [JSON list of elements]

}

The following list shows the CKAN return elements and the implementation in the New API

Response Size

Depending on the response size the data transfer can have ill network effects or overloading the server connections. This is why asynchronous and streaming response data will be needed.

The fact that the current implementation responds in JSON by default adds a data overload that is simply not there in formats like CSV which do not repeat the field names for every response element.

Another option for big size response is to return a SFTP URL (or other secure file transfer protocol) instead. This URL will contain a file with the response data once the operation is complete. This way the connection can be freed and the client can poll for the file.

This solution has two extra advantages:

Security

Security needs to be implemented, one option is JWT which allows for signed requests in the GET query.

JWT has the advantage of already being compatible with the current JSON parameter implementation in the New API.

Example 1

Input

/?resource_id=test_id

Test table is a table like this:

| _id | a | b | c |

Incoming query in structured format

{
  resource_id: test_table
}

Intermediate structure "JSON graqhql"

{
  table: test_table
  fields: [...]
}

Final graphql to send to client

{
  test_table {
    _id,
    a,
    b,
    c,
  }
}

Returns from Graphql

Returns to user

EvgeniiaVak commented 3 years ago

@rufuspollock Reimplementing datastore_search seems to be not a simple task:

some features might be tricky to implement, e.g. Setting the plain flag to false enables the entire PostgreSQL full text search query language. involves setting up postgres indexes.

Do we want to do that?

rufuspollock commented 3 years ago

@EvgeniiaVak i would say approach here is to support the API as much as is reasonable and note where we deviate. Here i think the full text indexing was a mistake anyway so not supporting it is a good idea anyway ...

I'd really be focusing on the basic part of datastore_search which is ability to do a free query q or do filter by field. Frankly, i would be doing the minimum here and flagging where we therefore deviate from pre-existing setup.

leomrocha commented 3 years ago

@rufuspollock @EvgeniiaVak there is another reason which makes full text search in most databases an issue (appart from the search performance and indexing part). The full text search algorithms behind the scenes are mostly done for English language (Postgresql being one of those examples), even if there are a few algorithms available for others (such as French), non-english support is not good. IMHO full text search should be only done if absolutely needed and with the good backend (Solr, Elasticsearch)

EvgeniiaVak commented 3 years ago

@leomrocha so we decided to not to implement q request parameter at all (because it uses the full-text search functionality), but maybe that's not very right, because judging by the name datastore_search it is meant for finding a resource. And also Rufus' comment was:

I'd really be focusing on the basic part of datastore_search which is ability to do a free query q or do filter by field.

So we really need this q, let's just implement some simpler search - e.g. with https://lunrjs.com/ by just dataset names for now.

EvgeniiaVak commented 3 years ago

@leomrocha Although it's a bit strange that the current datastore_search returns both resource description and data all at the same endpoint. As a client app, I would first find the needed resource and only then request its data by id (both these actions are supported right now) but I would expect getting data to be different endpoint 🤔

or by using ilike from hasura https://hasura.io/blog/full-text-search-with-hasura-graphql-api-postgres/

leomrocha commented 3 years ago

@EvgeniiaVak

@leomrocha Although it's a bit strange that the current datastore_search returns both resource description and data all at the same endpoint.

current as in current CKAN or current as the code in this repo?

As a client app, I would first find the needed resource and only then request its data by id (both these actions are supported right now) but I would expect getting data to be different endpoint

It makes sense

or by using ilike from hasura https://hasura.io/blog/full-text-search-with-hasura-graphql-api-postgres/ I don't see the relation with the resource description here. This is full text search if I understand correctly

leomrocha commented 3 years ago

@rufuspollock

With @EvgeniiaVak we are discussing about how to implement in a clear way the endpoints: We are thinking of the following endpoints that might be of use, but we don't have full text search. If we understand correctly the q search would be a free search or filter search

/datastore_search/?resource_id /datastore_search/help -> by default if the query is not valid /list_resources

For the filter search we'll change the format to JSON in the request query, that will convert inside to an sql or graphql query (most probably we'll start with graphql and do sql only if needed)

/datastore_search/?resource_id=...&q=....

@rufuspollock for the filter case, we would do filtering by one single resource, but is it needed to do a more general filtering considering several resources instead?

rufuspollock commented 3 years ago

AGREED next actions @leomrocha

leomrocha commented 3 years ago

@rufuspollock Is the analysis OK for you or you also want the examples here in the analysis? As a note, there are examples in the README.md file with screenshots on the responses on a local server.

If OK by you I'll take out the Examples section in the current issue's Analysis section.

leomrocha commented 3 years ago

@EvgeniiaVak @rufuspollock Should we close this issue now? CI should be issue #4 and analysis section needs @rufuspollock OK

EvgeniiaVak commented 3 years ago

FIXED. See https://github.com/datopian/data-api/pull/13 and https://github.com/datopian/data-api/pull/16