Translate the SQL data model API to Cassandra (CQL)

tmvoe commented 4 years ago

The following text is from Integrate Cassandra Support into Cenzontle

In order to implement the new storage type cassandra the following steps suggest themselves.

Cassandra Setup / simple migration

First, setup a "sandbox" environment with Docker using the latest stable Cassandra Docker image and add it to our sandbox setup. Write a simple Cassandra Query Language script setup_cassandra_db.js to setup the Cassandra database for your needs, i.e. create a KEYSPACE default and all TABLEs required for your data models. Add an additional table CREATE TABLE db_migrated ( migrated_at timeuuid PRIMARY KEY ) which will hold only a single row if and only if the setup_cassandra_db.cql has been executed, i.e. the last thing the script does is insert a row into the db_migrated table and thus mark the DB as ready for usage. The first thing the setup_cassandra_db.js script does is to check whether the database has been migrated already. If and only if the table db_migrated does not exists and does not contain a valid row, the setup procedure will be executed. Extend the shell script used to start up the backend GraphQL-server to (i) wait for the Cassandra server to be available and (ii) run the migration-script.

Information about how to obtain automatically generated timeuuids can be found here.

Indices and searches

Note that in the context of Cenzontle it is highly recommendable to create indices on all table columns in order to enable exhaustive searches.

Introduce the new storage type `cassandra`

Copy `SQL` data model templates

Make two new storage types available: cassandra data models and cassandra-adapter distributed data model (DDM) adapters. Copy the code-generator templates that are responsible for generating the data model layer modules for SQL data models. Use the document in which the SQL statements are defined that implement the respective data model layer functions and translate them to CQL which is very alike to SQL.

Allow filtering to authorized users

Introduce a new authorization check in all read operations that execute database searches. Currently, only read will be checked on a given data model name or adapter name. Add search to the checks. If the user has search authorization append ALLOW FILTERING to the generated CQL queries.

Introduce a `cassandra` storage-handler

Currently, Cenzontle's models_index.js script initializes the data models, including their connection to relational databases through Sequelize.

Where to store the new `cassandra` models?

Create a new directory models-cassandra in which Node.js modules for data models of the new storage type cassandra will be stored.

Initialize `cassandra` data models and adapters

Provide a JSON configuration file in ./config/cassandra.json that has all data needed by the Node.js Cassandra Driver. Adjust models_index.js to read in this config, initialize the Driver and set it in each cassandra data model module class as a constant property storageHandler. See this post for how to define class constants in ECMA Script 6. Basically use something like the following snippet

// FILE ./models-cassandra/person.js

// The Person data model has storage type `cassandra`
class Person {
  constructor(/* ... */) {
    // ... set the properties ...
  }

  get storageHandler() {
    // defined below by `Object.defineProperty`
    return Person.storageHandler
  }
}

module.exports.getAndConnectDataModelClass( cassandraDriver ) {
  return Object.defineProperty(Person, 'storageHandler', {
      value: cassandraDriver,
      writable : false, // cannot be changed in the future
      enumerable : true,
      configurable : false
  }) 
}

// FILE models_index.js

const cassandraDriver = /* initialize a new Cassandra Driver instance */
// constant Person will be a ECMAScript Class
const Person = require(
    './models-cassandra/person.js'
  ).getAndConnectDataModelClass(cassandraDriver)

The above has the advantage that both on the Person class level as well on the level of Person records (instances) the storage handler is available:

let person = new Person(/* ... */)
person.storageHandler // -> instance of Cassandra-Driver
Person.storageHandler // -> the very same instance of Cassandra-Driver

tmvoe commented 4 years ago

Keyspace default is not available in Cassandra, using sciencedb.

tmvoe commented 4 years ago

Important restrictions of Cassandra compared to SQL (adapted from StackOverflow while removing cases that are no longer valid):

There are no joins or subqueries
There are no transactions
Applying a WHERE condition on a column needs an index on that column - this condition seems to be not enforced if filtering is allowed
There are no OR or NOT logical operators, only AND
Updating a table requires a WHERE clause and the WHERE clause must be based on a (list of) primary key(s)
AND clauses must not use parentheses

The fourth and (to a lesser extent) fifth points seem to be the most severe.

tmvoe commented 4 years ago

Cassandra offers a variant of a secondary index that might offer better performance / more functionality (e.g. the LIKE operator): https://github.com/apache/cassandra/blob/trunk/doc/SASI.md

tmvoe commented 4 years ago

Current state of the issue

As of right now, the following points have been completed:

Indices and searches
Allow filtering to authorized users
Introduce a cassandra storage handler

The point Where to store the new cassandra models? was changed insofar as the directory structure of the models is now different: The cassandra models can now be found in models/cassandra.

Regarding Copy `SQL` data model templates

The templates themselves have been copied, but the transformation is still in progress. Right now there is still a problem with readAllCursor in the DDM case, but that should be tentatively resolved at the end of today. After that, some functionality (e.g. deletion in a loop) still needs to be tested.

Regarding Cassandra setup / simple migration

The keyspace (in this case sciencedb since default seems to not be allowed as a name) and the tables (including db_migrated) are created. To create the keyspace, the following steps were taken:

The docker entrypoint sh file was copied from the Cassandra container and modified according to https://stackoverflow.com/a/42698847
A cql file with the necessary information for creating the keyspace (for now named cassandra-keyspace.cql) is copied by docker-compose to the docker-container in the initialization directory and executed by the aforementioned change in the entrypoint sh.

The tables are created by migration files (see the migration template)

Regarding _Initialize `cassandra` data models and adapters

The storageHandler is created by models/index.js. Unfortunately, this handler is also required to test the connection to Cassandra during the startup of the servers, leading to repeated messages about the models being loaded. Apparently this is only a cosmetic problem, so this still happens. What has not yet been done is providing configuration of the Cassandra driver via a JSON file, so that existing Cassandra servers can not yet be accessed.

Additional information

A bug that was only discovered today seems to require a change of the schema in case of DDM files. The problem is as follows:

In case of DDMs the function readAllCursor (see above) needs to collect records from different cursors and provide a list of entries formed from them. In case of SQL, a custom ordering can be provided, but Cassandra doesn't allow for that unless someone is willing to use up a lot of additional storage space by duplicating the tables for each column that ordering should be possible for. Instead, Cassandra provides its own sorting going by the token (a hash function) of the ID value. To provide proper pagination, the full list of records needs to be ordered in the same way (otherwise, a sorting cursor would have no meaning). The token itself can only be generated by Cassandra itself. For the local adapter, this is no problem, since the token can be queried alongside the model attributes. But the remote adapter can only send a query according to the GraphQL schema. So the token needs to be added to the schema. Unfortunately, the token now looks like just another attribute of the schema, and can also be accessed by the end user (e.g. in GraphiQL). It should be determined if it is possible to avoid this problem.

Zendro-dev / graphql-server