Cassandra database design

VincenzoFerme commented 9 years ago

Discuss the design of the Cassandra database, similar to what has be done for Minio in #4

Cerfoglg commented 9 years ago

The database schema in Marco Argenti's thesis (figure 5.1) is a good place to start. In that, we have the following tables and columns:

EnvironmentData (ID, EnvironmentID, Time, LineNumber, CPUPercent, MemoryUsed, MemoryTot, MemoryPercent, NetworkIn, NetworkOut, TrialID)
Experiment (TrialID, ExperimentID, ReplicationNum)
Process (ProcessInstanceID, SourceProcessInstanceID, ProcessModelID, StartTime, Duration, EndTime, TrialID)
Construct (ConstructID, SourceConstructID, ConstructType, ConstructName, StartTime, Duration, EndTime, ProcessInstanceID)

These can be easily written in Cassandra as column families. However, Cassandra is a schemaless, distributed database, which raises few problems. Let's assume that we want to look for environmental data with a certain trial ID. If we were to query Cassandra with something like SELECT * FROM EnvironmentData WHERE TrialID = ***, we may end up having to read several nodes of our Cassandra cluster, which has a significant impact on performance. To counter this, it's a good idea to try creating inverted indexes to map important values, such as TrialID to its associated environmentData. This way, we avoid having queries going to more than one node, which is better. The inverted indexes we should use are:

TrialEnvironmentData (TrialID, EnvDataID)
TrialExperiment (TrialID, ExperimentID)
TrialProcess (TrialID, ProcessInstanceID)
ProcessConstruct (ProcessInstanceID, ConstructID)

Attached is a quick sketch showing Marco's database schema, plus the additional indexes on the side.

sketch

VincenzoFerme commented 9 years ago

@Cerfoglg thank you for the description.

Some notes and TODOs:

Reference schema - Table: Experiment, Field: TrialID: it is a composite key of ExperimentID and ReplicationNum. It is useful so that we can directly identify data related to a given replication of the same experiment (e.g., run number 3 of experiment a6)
~~If we use the suggested inverted index we can remove some redundant data from some tables (e.g., does it makes sense to have the TrialID in EvironmentData and Process?)~~
~~What are the PRIMARY and what are the SECONDARY indexes?~~
~~Give also a look at the following reference: https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes~~

We should also consider when it makes sense to add Materialized View to access data in a more performant way: http://www.datastax.com/dev/blog/new-in-cassandra-3-0-materialized-views

Cerfoglg commented 9 years ago

Actually, looking at materialised views, if I'm understanding it correctly, we may just want to use those instead of creating explicit inverted indexes, as they help solve the same issue. Opinions?

VincenzoFerme commented 9 years ago

@Cerfoglg it seems. Give a look at the following three references, in order to came up with the right approach according to our requirements:

Write down PROs and CONs of both of the approaches. First clearly write down the requirement we have in writing and reading data.

Cerfoglg commented 9 years ago

Attaching here the code to create the Cassandra database. To use this, run cqlsh with option -f to pass this file.

CREATE KEYSPACE benchflow
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

USE benchflow;

CREATE TABLE environment_data (
  environment_data_id uuid,
  container_properties_id uuid,
  read_time text,
  cpu_percpu_usage list<bigint>,
  cpu_total_usage bigint,
  cpu_percent_usage float,
  cpu_throttling_data map<text, bigint>,
  memory_usage bigint,
  memory_max_usage bigint,
  network_interfaces map<text, uuid>,
  experiment_id text,
  trial_id text,
  PRIMARY KEY ((trial_id, experiment_id), environment_data_id, container_properties_id)
);

CREATE TABLE network_interface_data (
  network_interface_data_id uuid,
  network_rx_dropped bigint,
  network_rx_bytes bigint,
  network_rx_errors bigint,
  network_tx_packets bigint,
  network_tx_dropped bigint,
  network_rx_packets bigint,
  network_tx_errors bigint,
  network_tx_bytes bigint,
  PRIMARY KEY (network_interface_data_id)
);

CREATE TABLE experiment (
  experiment_id text,
  replication_num int,
  trial_id text,
  PRIMARY KEY ((trial_id, experiment_id))
);

CREATE TABLE process (
  process_instance_id uuid,
  source_process_instance_id text,
  process_definition_id text,
  start_time timestamp,
  duration bigint,
  end_time timestamp,
  experiment_id text,
  trial_id text,
  PRIMARY KEY ((trial_id, experiment_id), process_instance_id)
);

CREATE TABLE construct (
  construct_instance_id uuid,
  source_construct_instance_id text,
  construct_type text,
  construct_name text,
  start_time timestamp,
  duration bigint,
  end_time timestamp,
  process_instance_id uuid,
  experiment_id text,
  trial_id text,
  PRIMARY KEY ((trial_id, experiment_id), construct_instance_id, process_instance_id)
);

CREATE TABLE container_properties (
  container_properties_id uuid,
  container_id text,
  experiment_id text,
  trial_id text,
  host_id uuid,
  environment map<text, text>,
  image text,
  labels list<text>,
  links list<text>,
  log_driver text,
  u_limits map<text, int>,
  volume_driver text,
  volumes_from list<text>,
  cpu_shares int,
  cpu_set_cpus text,
  cpu_set_mems text,
  cpu_quota int,
  cpu_period int,
  blkio_weight int,
  mem_limit text,
  mem_swap_limit text,
  mem_reservation_limit text,
  mem_kernel_limit text,
  memory_swappiness int,
  oom_kill_disable boolean,
  privileged boolean,
  read_only boolean,
  restart text, 
  user text,
  name text,
  network text,
  restart_policy text,
  PRIMARY KEY ((trial_id, experiment_id), container_properties_id, host_id)
);

CREATE TABLE host_properties (
  host_id uuid,
  cpu_cfs_period boolean,
  cpu_cfs_quota boolean,
  debug boolean,
  discovery_backend text,
  docker_root_dir text,
  driver text,
  driver_status list<text>,
  execution_driver text,
  experimental_build boolean,
  http_proxy text,
  https_proxy text,
  ipv4_forwarding boolean,
  index_server_address text,
  init_path text,
  init_sha1 text,
  kernel_version text,
  labels list<text>,
  mem_total bigint,
  memory_limit boolean,
  n_cpu int,
  n_events_listener int,
  n_fd int,
  n_goroutines int,
  name text,
  no_proxy text,
  oom_kill_disable boolean,
  operating_system text,
  swap_limit boolean,
  system_time text,
  server_version text,
  docker_version text,
  docker_os text,
  docker_kernel_version text,
  docker_go_version text,
  docker_git_commit text,
  docker_arch text,
  docker_api_version text,
  docker_experimental boolean,
  alias text,
  external_ip text,
  internal_ip text,
  hostname text, 
  purpose text,
  PRIMARY KEY(host_id)
);

Definitions:

Experiment: Trial_ID is the concatenation of Experiment_ID_Replication_Num

Setup:

We should use Murmur3Partitioner for partitioning the data

Cerfoglg commented 8 years ago

@VincenzoFerme ~~I updated the schema a bit. For environmentData I'm using uuid type for ID, because I'm now generating uuid values within the stats transformer script, making sure rows receive unique IDs.~~

~~I'm leaving the materialized view definitions, though as of right now I've only used Cassandra 2.2.3 for testing, as Cassandra 3 doesn't currently work with the pyspark cassandra connector.~~

VincenzoFerme commented 8 years ago

@Cerfoglg ~~Looking at the schema you provided in https://github.com/benchflow/data-transformers/issues/6#issuecomment-160972532, I have the following questions:~~

In point 1) of https://github.com/benchflow/data-transformers/issues/6#issuecomment-158329810 I state that "TrialID is a composite key of ExperimentID and ReplicationNum. It is useful so that we can directly identify data related to a given replication of the same experiment (e.g., run number 3 of experiment a6).". To obtain so, we should define a way (read, a function library shared in the common repository) to get the TrialID given as input the ExperimentID and ReplicationNum, because you need this computed value to store the data in Cassandra. You can refer to Marco's solution for this.
~~Change the name of ProcessModelID in Process, to ProcessDefinitionID~~
~~Change the name of ConstructID to ConstructInstanceID and SourceConstructID to SourceConstructInstanceID in Construct~~
~~In table Process and Construct, how do you compute ProcessInstanceID and ConstructID respectively?~~
In Construct, if the ConstructID is unique, why do you define PRIMARY KEY (ConstructID, ProcessInstanceID)? The same applies to the Process table and its primary key. Is is related to the following http://stackoverflow.com/a/24953331? If yes, why this choice of the key? The same discussion applies to the EnvironmentData table
~~Why the primary key of the experiment table is PRIMARY KEY (ExperimentID, TrialID) and not just TrialID?~~
~~EnvironmentData, remove LineNumber int~~
~~The EnvironmentData should contain the following information, as described on the Docker Stats API documentation:~~
- ~~read~~
- networks: for every interface we should store all the provided metrics
- memory_stats: usage, max_usage
- cpu_stats: throttling_data, percentage with respect to the of total CPU (total_usage) and per CPU (percpu_usage) usage of the container. You can get some hints on the actual meaning of the data and what they are about on the following link: https://www.datadoghq.com/blog/how-to-collect-docker-metrics/
~~The name of the field in our schema should be meaningful, to simply map the value to the actual metrics: e.g., memory_stats -> usage should be called memory_usage~~
EnvironmentData, EnvironmentID reference: this should reference to a table where we store the information of the environment in which the data have been collected. Think of it as the metadata describing a Docker container's properties (e.g., provided CPU, provided RAM, network configuration and so on). For now, lets add a table to which point this reference, that contains the following data (some of them are optional), as described on the Docker Compose file reference:
- environment: the list of the environment variables and their corresponding values
- ~~image~~
- ~~labels~~
- ~~links~~
- ~~log_driver~~
- ~~ulimits~~
- ~~volume_driver~~
- ~~volumes_from~~
- cpu_shares, cpuset, mem_limit, memswap_limit, privileged, read_only, restart, user (x)
~~and the following ones described on the Docker Run reference:~~
- ~~name~~
- ~~network~~
- ~~restart policy~~
- runtime constraints on the resource. The ones missing from (x)
~~Materialized views, without the actual common data we are going to use for analyzers, might not be useful. Lets keep it in mind when defining the analysers.~~

Cerfoglg commented 8 years ago

@VincenzoFerme ~~Alright, I altered the file to fix things based on your suggestions, but there are still a few things left hanging:~~

If the TrialID is something like say, for ReplicationNumber 3 and ExperimentID a7 we have TrialID a7_3, then computing it from ReplicationNumber and ExperimentID should be pretty straight forward, right? I'm not sure I understood this correctly.
If they aren't given, then they have to be computed on the fly when store to Cassandra, because like I've mentioned: Cassandra has no AUTO_INCREMENT option to automatically create IDs like Mysql for example. Right now I use uuids for unique row IDs.
~~Shouldn't the Experiment table have primary key ExperimentID? TrialID is a composite of ExperimentID and ReplicationID, so it refers to specific runs of an experiment as well.~~

Also, note that in Cassandra when you write PRIMARY KEY (x, y, z), it doesn't mean that x, y and z are the primary key, it means that x is primary key, and y and z are column ordering keys. To write a composite key in Cassandra you write PRIMARY KEY ((x, y, z)). So in those tables I only had one primary key ^^

VincenzoFerme commented 8 years ago

@Cerfoglg ~~thank you for the detailed answer and for changing the schema. Some remaining points to discuss about:~~

You are right, we can keep it simple on Cassandra. The previous solution, using MySQL, had to implement an MD5 hash of the composite key for improving the performance. We can start from what you propose: ExperimentID_ TrialID, but then we might want to get the MD5 hash of this string to enhance the performance retrieval of the data. It depends on how it works on Cassandra about performance of hash vs string retrieval and internal optimizations Cassandra does. Maybe it already use hashes internally, even if we define strings.
~~How it was on Marco's thesis? I guess it was an AUTO_INCREMENT. If this is the case, just use uuid also for these.~~
~~Because you can have more repetition of the same ExperimentID, and the PRIMARY KEY should be unique.~~

~~For PRIMARY KEYs / Column Ordering (or Clustering) Keys: ok, thank you for the explanation. Can you please write down for each of the definition, why do we have that columns?~~

~~EnvironmentData:~~
- ~~how can I store the information for every interface, in your definition?~~
- ~~cpu_stats: how can I store the information for each CPU. How do you compute the percentage with respect to the total usage?~~
~~DockerContainerProperties:~~
- ~~Change the name of DockerContainerProperties to ContainerProperties~~
- ~~Where is the EnvironmentID we use as reference in EnvironmentData~~
- ~~What is the datatype of cpu_shares?~~
- ~~Why privileged is a text? Shouldn't it be a boolean? The same applies to read_only. Just cross-check also the other fields.~~
- ~~I guess you missed: runtime constraints on the resource. The ones missing from (x)~~

~~In general, for all the schema: We should be consistent about the naming of the fields: the underscore_case notation is fine.~~

Cerfoglg commented 8 years ago

@VincenzoFerme

~~I think Cassandra does it on its own, so we shouldn't have to worry about hashing ourselves.~~
~~He was probably using an auto increment.~~
~~I was trying to order by relevant IDs, like Env data is ordered by EnvironmentID and TrialID, Experiment by ExperimentID, Process by TrialID, like so.~~
~~Yeah you're right, I'll change that to use TrialID.~~
The network data is all provided in the Docker stats just like that (r for receive, t for transfer). As for CPUs, we have discussed this early in the semester: with just the Docker Stats API we can't calculate CPU percentages. The stats only give you a total usage, and then an array representing per core usage, without telling you which core is which, or what their max would be. I'm afraid we may have to make use of these data as they are, because the alternative could end up being extremely complex to implement.

~~Anyways, will update the definitions with these observations.~~

VincenzoFerme commented 8 years ago

@Cerfoglg

~~Assign the uuid type to the field, if you are using uuid to get the unicity.~~
Ok. Since it matter on the performance, just consider that we are going to access the data mostly by Trial_ID and Experiment_ID. This means you can have repetition of these fields in all the tables (e.g., in Constructs), for performance purposes, and also setup the schema so that the retrieval of the data using these two IDs is fast. If this is the scope of the ordering. We should order before for trial_id because it is more selective.
~~Why the last fields of Container_Properties are lowercase? Be consistent with naming all over the tables.~~
~~DockerContainerProperties:~~
- ~~what is the difference between Container_ID and Environment_ID?~~
- ~~change the name of Cpu_set to Cpu_Set_Cpus if it correspond to this value defined by Docker~~
- ~~where do you define kernel-memory?~~

VincenzoFerme commented 8 years ago

@Cerfoglg

8.1 ~~I'm going to answer you in details in the following.~~

~~First, how can I store the information the API returns when you have more than one network interfaces in you schema? Example from the Docker documentation:~~

...
"networks": {
             "eth0": {
                 "rx_bytes": 5338,
                 "rx_dropped": 0,
                 "rx_errors": 0,
                 "rx_packets": 36,
                 "tx_bytes": 648,
                 "tx_dropped": 0,
                 "tx_errors": 0,
                 "tx_packets": 8
             },
             "eth5": {
                 "rx_bytes": 4641,
                 "rx_dropped": 0,
                 "rx_errors": 0,
                 "rx_packets": 26,
                 "tx_bytes": 690,
                 "tx_dropped": 0,
                 "tx_errors": 0,
                 "tx_packets": 9
             }
     }
...

~~How can I store the following information returned by the Docker API, when I have more than one core for percpu_usage:~~

...
"cpu_stats" : {
        "cpu_usage" : {
           "percpu_usage" : [
              16970827,
              1839451,
              7107380,
              10571290
           ],
           ...
           "total_usage" : 36488948,
           ...
        },
        "system_cpu_usage" : 20091722000000000,
        "throttling_data" : {}
     }
...

VincenzoFerme commented 8 years ago

@Cerfoglg

8.2 Docker Stats API returns also a field named precpu_stats. On this reference you can find a python implementation relying on the precpu_stats data to compute the CPU usage. The referenced method also solves the problem of normalising the CPU percentage per number of cores. If it works as expected we should obtain a number in the range [0,100].

VincenzoFerme commented 8 years ago

@Cerfoglg add also a table about Host_Properties, mapped to the Container_Properties table through the current Environment_ID. Then rename it to Host_ID. The table should contain the data obtained from the following two APIs:

~~info. All the data but: containers, images, RegistryConfig~~
~~version~~

Cerfoglg commented 8 years ago

@VincenzoFerme Made changes based on your comments: added the new table for host properties, plus the network interface table for the data related to network interfaces per environment data, and using lists for percpu and network interfaces in environment data.

VincenzoFerme commented 8 years ago

@Cerfoglg ~~thank you for the adding. Some notes:~~

~~Host Properties:~~
- ~~Remove containers, images and registry_config related data, as written in my previous comment.~~
- ~~driver_status seems that should be [[""]]~~
- ~~Prepend docker_ to the properties retrieved using the following API: https://docs.docker.com/engine/reference/api/docker_remote_api_v1.22/#show-the-docker-version-information~~
- ~~Add the following informations: alias type text, external_ip type text, internal_ip type text, hostname type text, purpose type text~~
- ~~You forgot the following part of my comment: "mapped to the Container_Properties table through the current Environment_ID. Then rename it to Host_ID"~~
~~environment_data:~~
- ~~throttling_data should be an object: https://github.com/docker/docker/blob/master/api/types/stats.go#L8~~
- ~~Why the network_interfaces list<text> has list type? Why not list given that in network_interface you have environment_data_id uuid?~~
~~network_interface:~~
- ~~Why do you also have name as key?~~

Cerfoglg commented 8 years ago

@VincenzoFerme ~~Modified the schema again based on your comments, except for these points:~~

~~throttling_data I made into a map, so I can have the field of the throtthling data mapped to the value of it for convenience, and because it's better than making an entire new table.~~
the network_interfaces field in environment_data it's also now a map from the name of the interface to the uuid of the row in network_interface containing the data for that interface. This should streamline things a bit.

VincenzoFerme commented 8 years ago

@Cerfoglg thank you for the update. Some remaining points:

~~about your comment on throttling_data and network_interfaces: ok~~
what is environment_id in environment_data? Shouldn't it be container_id now? Or you have the mapping in the Container_Properties using the environment field? If this is the case why this name and why it does not reference to the id type uuid in environment_data?
~~rename Container_Properties to container_properties. And its field trial_ID to trial_id~~
~~go over all the previous point that are not strikethrough, except for the materialized views on which we agreed already to wait until we move to Cassandra 3~~

Cerfoglg commented 8 years ago

@VincenzoFerme ~~Fixed the points above, plus I added experiment_id and trial_id to all tables that didn't have it, and used them for column ordering (see PRIMARY KEY definitions)~~

Cerfoglg commented 8 years ago

@VincenzoFerme ~~One last thing, rearranged the clustering order in the PRIMARY KEY definitions~~

Cerfoglg commented 8 years ago

@VincenzoFerme ~~I think what we also need is replication_num in tables like process, because if we are trying to analyse on a certain rep number, we should probably have it available.~~

~~In fact, do we even need trial_id, if we know that it's just the concatenation of experiment_id and replication_id? And do we need the Experiment table for that matter?~~

VincenzoFerme commented 8 years ago

@Cerfoglg ~~trial_id replaces replication_num. We can access the data of a certain replication of an experiment by directly using the trial_id.~~

~~We need the experiment table for storing the existence of experiment entities, with certain assigned ids.~~

Cerfoglg commented 8 years ago

@VincenzoFerme ~~Fixed that. Also using trial_id and experiment_id for partition keys.~~

VincenzoFerme commented 8 years ago

@Cerfoglg thank you for the effort. I made the following changes:

container_properties: links type converted to list
container_properties: removed the following properties becasue part of the RegistryConfig we don't want to store, as per https://github.com/benchflow/data-transformers/issues/6#issuecomment-163598484:
- docker_io_mirrors text,
- docker_io_name text,
- docker_io_official boolean,
- docker_io_secure boolean,
container_properties: mem_limit,mem_swap_limit,mem_reservation_limit,mem_kernel_limit type converted to text, given that Docker allows you to specify the unit
container_properties: volumes_from type converted to list
container_properties: u_limits type converted to list . You can have a nested structure (e.g., soft field under nofile). The idea of using a map is that you can convert the previous structure to the following key: nofile_soft. @Cerfoglg is there a better way to map this types of structure in Cassandra?
container_properties: labels type converted to list
network_interface: type converted to bigint for all the fields but id
process and construct: duration type converted to biging
construct: process_instance_id type converted to uuid
container_properties: host_id type converted to uuid
container_properties: added an additional id in order to not rely on the container_id returned by Docker because it can be the same for different containers runned on different hosts. I also updated all the references in all the other tables.
network_interface: renamed to id to network_interface_data_id
network_interface: renamed to network_interface_data
environment_data: added environment_data_id, it was missing an id. I also added it to the PRIMARY KEY
host_properties: we don't need experiment_id and trial_id since they represent physical hosts on which we can execute more experiments.

If you need to change something, please do it on: https://gist.github.com/VincenzoFerme/c3f142935cc0f89a99c9 (the gist has been removed because out of date)

VincenzoFerme commented 8 years ago

@Cerfoglg something we should evaluate to enhance the performance of our schema definition:

Define an order: decide whether or not it would make sense to define a DESC order on trial_id and/or experiment_id, and for which table. Reference: http://www.planetcassandra.org/blog/we-shall-have-order/
Define indexes: decide whether or not it would make sense to define indexes, and for which table. Reference: https://docs.datastax.com/en/cql/3.0/cql/cql_reference/create_index_r.html

If you need to change something, please do it on: https://gist.github.com/VincenzoFerme/c3f142935cc0f89a99c9 (the gist has been removed because out of date)

Cerfoglg commented 8 years ago

@VincenzoFerme

With Cassandra we want to order our data in the same order we will eventually query them. I would assume that we want an ASCending order, since the earlier experiments and repetitions of them will be the earliest data we have available. Cassandra by default orders in ASCending way, so we don't have to specify it if we already use trial_id and experiment_id in our primary keys.
I'm not sure if we really need them in this case, considering we are using trial ids and experiment ids everywhere for partitioning and ordering, and we will access the database mostly by them.

VincenzoFerme commented 8 years ago

@Cerfoglg

DESCending then, if you want the latest to be accessed more easily. We should get the fastest performance to access the latest data we store. We should also think about if we need another field to store the timestamp of the record stored in the database, or we can directly rely on Cassandra and ask it to store the data putting on top the latest inserted one.
ok. Lets consider them for the future.

benchflow / data-transformers

Cassandra database design #6