Krymnos / IDP

University project
0 stars 0 forks source link

try a noSQL DB for saving both dependencies and context together #14

Closed Krymnos closed 6 years ago

Krymnos commented 6 years ago

compare issue #13

darshan0071990 commented 6 years ago

Keeping the data structure in mind and understanding the need for higher degree of data ingestion and complexity of queries after researching multiple NoSQL database I found Cassandra to be optimum for our scenario. I will implement the database and would like to know how the communication is going to take place from pipeline to the DB.

darshan0071990 commented 6 years ago

Cassandra is an open-source NoSQL data store intended for clusters of servers (i.e., nodes). It employs a peer-to-peer design. It can handle massive data sizes and scale out to large clusters. Cassandra offering continuous availability, high scalability and performance, strong security, and operational simplicity. It has decentralized architecture. Any node can perform any operation. It provides AP(Availability,Partition-Tolerance) from CAP theorem. It has excellent single-row read performance as long as eventual consistency semantics are sufficient for our data-model which consists of [ sensor_id + dependencies + context]. It is well suited for supporting single-row queries, or selecting multiple rows based on a Column-Value index.

Data Model. Row-oriented Internal Operations: Data operations such as reading, writing, compaction, and partitioning, are performed at the row-level, i.e., data itemlevel. Cassandra encodes all row-level operations internally in a unified form called Row Mutation.

Cassandra Hierarchy Organization: Rows are grouped into column families (i.e., tables) such that the rows within a column family are identified by primary keys ( Hash of [sensor_id+timestamp]). A keyspace (or schema) is a logical grouping of column families, specifiable by a us. For instance, tables are typically grouped under the keyspace (in our case) name as "tracingDB".

Cassandra Query Language (CQL): Cassandra provides a SQL-like (but not SQL-complete) query language called CQL [9]. CQL commands include data definition queries (e.g., create table), data manipulation queries (e.g., insert and select for rows), and basic authentication queries to create database users and grant them permissions.

Partitioning vs. clustering keys: Within a column family, the primary key of each row is divided into two parts. The first part is the partitioning key, which is used by Cassandra to place the key at a server in the cluster. Replica servers for that key are then assigned clockwise starting from that point in a virtual ring. The second part of the primary key is the clustering key, which is used by the Cassandra to cluster the nonkey columns into a single storage block associated with the partitioning key.

create table idp(
      k_part_one text,
      k_part_two int,
      k_clust_one text,
      k_clust_two int,
      k_clust_three uuid,
      data text,
      PRIMARY KEY((k_part_one,k_part_two), k_clust_one, k_clust_two, k_clust_three)      
  );

The Partition Key is responsible for data distribution across your nodes. The Clustering Key is responsible for data sorting within the partition. The Primary Key is equivalent to the Partition Key in a single-field-key table. The Composite/Compound Key is just a multiple-columns key.

Key characteristics: · High availability · Incremental scalability · Eventually consistent · Trade-offs between consistency and latency · Minimal administration · No SPF (Single point of failure) – all nodes are the same in Cassandra · AP on CAP Good for: · Simple setup, maintenance code · Fast random read/write · Flexible parsing/wide column requirement · No multiple secondary index needed Not good for: · Secondary index · Relational data · Transactional operations (Rollback, Commit) · Primary & Financial record · Stringent and authorization needed on data · Dynamic queries/searching on column data

[https://www.linkedin.com/pulse/real-comparison-nosql-databases-hbase-cassandra-mongodb-sahu/](Cassandra vs Hbase vs MongoDB) [http://cassandra.apache.org/]( Apache Cassandra)

darshan0071990 commented 6 years ago

Benchmark (Partial)

Benchmarked Environment:

  1. Cassandra 3.0.4

  2. Docker Environment with 2gb memory allocated.

  3. System under Test: a. Core i5 5th Generation, b. 8gb RAM,

  4. Records Inserted 999999.

  5. Cassandra Configuration - Replication = Simple Strategy & Replication Factor = 1


  6. Bulk Import using CSV into database. Overall Processing Time = 227.710s Processed Rows= 999999 Avg Rows/s = 4402 bulk insert

  7. Retrieve Value 1000 record 0.869ms

darshan0071990 commented 6 years ago

https://app.zenhub.com/files/108867849/b77de79f-2a36-4ed9-becd-e3fc2e0f4f7c/download

https://docs.google.com/document/d/1dRDRi-on9WhYKdwnFC5NHVtAuwQCRz7RiFaHqWbPa-g/edit?usp=sharing