benchflow / data-transformers

Spark scripts utilised to transform data to the BenchFlow internal formats
Other
0 stars 0 forks source link

Cassandra database design #6

Open VincenzoFerme opened 9 years ago

VincenzoFerme commented 9 years ago

Discuss the design of the Cassandra database, similar to what has be done for Minio in #4

Cerfoglg commented 9 years ago

The database schema in Marco Argenti's thesis (figure 5.1) is a good place to start. In that, we have the following tables and columns:

These can be easily written in Cassandra as column families. However, Cassandra is a schemaless, distributed database, which raises few problems. Let's assume that we want to look for environmental data with a certain trial ID. If we were to query Cassandra with something like SELECT * FROM EnvironmentData WHERE TrialID = ***, we may end up having to read several nodes of our Cassandra cluster, which has a significant impact on performance. To counter this, it's a good idea to try creating inverted indexes to map important values, such as TrialID to its associated environmentData. This way, we avoid having queries going to more than one node, which is better. The inverted indexes we should use are:

Attached is a quick sketch showing Marco's database schema, plus the additional indexes on the side.

sketch

VincenzoFerme commented 9 years ago

@Cerfoglg thank you for the description.

Some notes and TODOs:

  1. Reference schema - Table: Experiment, Field: TrialID: it is a composite key of ExperimentID and ReplicationNum. It is useful so that we can directly identify data related to a given replication of the same experiment (e.g., run number 3 of experiment a6)
  2. If we use the suggested inverted index we can remove some redundant data from some tables (e.g., does it makes sense to have the TrialID in EvironmentData and Process?)
  3. What are the PRIMARY and what are the SECONDARY indexes?
  4. Give also a look at the following reference: https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes

We should also consider when it makes sense to add Materialized View to access data in a more performant way: http://www.datastax.com/dev/blog/new-in-cassandra-3-0-materialized-views

Cerfoglg commented 9 years ago

Actually, looking at materialised views, if I'm understanding it correctly, we may just want to use those instead of creating explicit inverted indexes, as they help solve the same issue. Opinions?

VincenzoFerme commented 9 years ago

@Cerfoglg it seems. Give a look at the following three references, in order to came up with the right approach according to our requirements:

Write down PROs and CONs of both of the approaches. First clearly write down the requirement we have in writing and reading data.

Cerfoglg commented 9 years ago

Attaching here the code to create the Cassandra database. To use this, run cqlsh with option -f to pass this file.

CREATE KEYSPACE benchflow
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

USE benchflow;

CREATE TABLE environment_data (
  environment_data_id uuid,
  container_properties_id uuid,
  read_time text,
  cpu_percpu_usage list<bigint>,
  cpu_total_usage bigint,
  cpu_percent_usage float,
  cpu_throttling_data map<text, bigint>,
  memory_usage bigint,
  memory_max_usage bigint,
  network_interfaces map<text, uuid>,
  experiment_id text,
  trial_id text,
  PRIMARY KEY ((trial_id, experiment_id), environment_data_id, container_properties_id)
);

CREATE TABLE network_interface_data (
  network_interface_data_id uuid,
  network_rx_dropped bigint,
  network_rx_bytes bigint,
  network_rx_errors bigint,
  network_tx_packets bigint,
  network_tx_dropped bigint,
  network_rx_packets bigint,
  network_tx_errors bigint,
  network_tx_bytes bigint,
  PRIMARY KEY (network_interface_data_id)
);

CREATE TABLE experiment (
  experiment_id text,
  replication_num int,
  trial_id text,
  PRIMARY KEY ((trial_id, experiment_id))
);

CREATE TABLE process (
  process_instance_id uuid,
  source_process_instance_id text,
  process_definition_id text,
  start_time timestamp,
  duration bigint,
  end_time timestamp,
  experiment_id text,
  trial_id text,
  PRIMARY KEY ((trial_id, experiment_id), process_instance_id)
);

CREATE TABLE construct (
  construct_instance_id uuid,
  source_construct_instance_id text,
  construct_type text,
  construct_name text,
  start_time timestamp,
  duration bigint,
  end_time timestamp,
  process_instance_id uuid,
  experiment_id text,
  trial_id text,
  PRIMARY KEY ((trial_id, experiment_id), construct_instance_id, process_instance_id)
);

CREATE TABLE container_properties (
  container_properties_id uuid,
  container_id text,
  experiment_id text,
  trial_id text,
  host_id uuid,
  environment map<text, text>,
  image text,
  labels list<text>,
  links list<text>,
  log_driver text,
  u_limits map<text, int>,
  volume_driver text,
  volumes_from list<text>,
  cpu_shares int,
  cpu_set_cpus text,
  cpu_set_mems text,
  cpu_quota int,
  cpu_period int,
  blkio_weight int,
  mem_limit text,
  mem_swap_limit text,
  mem_reservation_limit text,
  mem_kernel_limit text,
  memory_swappiness int,
  oom_kill_disable boolean,
  privileged boolean,
  read_only boolean,
  restart text, 
  user text,
  name text,
  network text,
  restart_policy text,
  PRIMARY KEY ((trial_id, experiment_id), container_properties_id, host_id)
);

CREATE TABLE host_properties (
  host_id uuid,
  cpu_cfs_period boolean,
  cpu_cfs_quota boolean,
  debug boolean,
  discovery_backend text,
  docker_root_dir text,
  driver text,
  driver_status list<text>,
  execution_driver text,
  experimental_build boolean,
  http_proxy text,
  https_proxy text,
  ipv4_forwarding boolean,
  index_server_address text,
  init_path text,
  init_sha1 text,
  kernel_version text,
  labels list<text>,
  mem_total bigint,
  memory_limit boolean,
  n_cpu int,
  n_events_listener int,
  n_fd int,
  n_goroutines int,
  name text,
  no_proxy text,
  oom_kill_disable boolean,
  operating_system text,
  swap_limit boolean,
  system_time text,
  server_version text,
  docker_version text,
  docker_os text,
  docker_kernel_version text,
  docker_go_version text,
  docker_git_commit text,
  docker_arch text,
  docker_api_version text,
  docker_experimental boolean,
  alias text,
  external_ip text,
  internal_ip text,
  hostname text, 
  purpose text,
  PRIMARY KEY(host_id)
);

Definitions:

Setup:

Cerfoglg commented 8 years ago

@VincenzoFerme I updated the schema a bit. For environmentData I'm using uuid type for ID, because I'm now generating uuid values within the stats transformer script, making sure rows receive unique IDs.

I'm leaving the materialized view definitions, though as of right now I've only used Cassandra 2.2.3 for testing, as Cassandra 3 doesn't currently work with the pyspark cassandra connector.

VincenzoFerme commented 8 years ago

@Cerfoglg Looking at the schema you provided in https://github.com/benchflow/data-transformers/issues/6#issuecomment-160972532, I have the following questions:

  1. In point 1) of https://github.com/benchflow/data-transformers/issues/6#issuecomment-158329810 I state that "TrialID is a composite key of ExperimentID and ReplicationNum. It is useful so that we can directly identify data related to a given replication of the same experiment (e.g., run number 3 of experiment a6).". To obtain so, we should define a way (read, a function library shared in the common repository) to get the TrialID given as input the ExperimentID and ReplicationNum, because you need this computed value to store the data in Cassandra. You can refer to Marco's solution for this.
  2. Change the name of ProcessModelID in Process, to ProcessDefinitionID
  3. Change the name of ConstructID to ConstructInstanceID and SourceConstructID to SourceConstructInstanceID in Construct
  4. In table Process and Construct, how do you compute ProcessInstanceID and ConstructID respectively?
  5. In Construct, if the ConstructID is unique, why do you define PRIMARY KEY (ConstructID, ProcessInstanceID)? The same applies to the Process table and its primary key. Is is related to the following http://stackoverflow.com/a/24953331? If yes, why this choice of the key? The same discussion applies to the EnvironmentData table
  6. Why the primary key of the experiment table is PRIMARY KEY (ExperimentID, TrialID) and not just TrialID?
  7. EnvironmentData, remove LineNumber int
  8. The EnvironmentData should contain the following information, as described on the Docker Stats API documentation:

    • read
    • networks: for every interface we should store all the provided metrics
    • memory_stats: usage, max_usage
    • cpu_stats: throttling_data, percentage with respect to the of total CPU (total_usage) and per CPU (percpu_usage) usage of the container. You can get some hints on the actual meaning of the data and what they are about on the following link: https://www.datadoghq.com/blog/how-to-collect-docker-metrics/

    The name of the field in our schema should be meaningful, to simply map the value to the actual metrics: e.g., memory_stats -> usage should be called memory_usage

  9. EnvironmentData, EnvironmentID reference: this should reference to a table where we store the information of the environment in which the data have been collected. Think of it as the metadata describing a Docker container's properties (e.g., provided CPU, provided RAM, network configuration and so on). For now, lets add a table to which point this reference, that contains the following data (some of them are optional), as described on the Docker Compose file reference:

    and the following ones described on the Docker Run reference:

  10. Materialized views, without the actual common data we are going to use for analyzers, might not be useful. Lets keep it in mind when defining the analysers.
Cerfoglg commented 8 years ago

@VincenzoFerme Alright, I altered the file to fix things based on your suggestions, but there are still a few things left hanging:

  1. If the TrialID is something like say, for ReplicationNumber 3 and ExperimentID a7 we have TrialID a7_3, then computing it from ReplicationNumber and ExperimentID should be pretty straight forward, right? I'm not sure I understood this correctly.

  2. If they aren't given, then they have to be computed on the fly when store to Cassandra, because like I've mentioned: Cassandra has no AUTO_INCREMENT option to automatically create IDs like Mysql for example. Right now I use uuids for unique row IDs.

  3. Shouldn't the Experiment table have primary key ExperimentID? TrialID is a composite of ExperimentID and ReplicationID, so it refers to specific runs of an experiment as well.

Also, note that in Cassandra when you write PRIMARY KEY (x, y, z), it doesn't mean that x, y and z are the primary key, it means that x is primary key, and y and z are column ordering keys. To write a composite key in Cassandra you write PRIMARY KEY ((x, y, z)). So in those tables I only had one primary key ^^

VincenzoFerme commented 8 years ago

@Cerfoglg thank you for the detailed answer and for changing the schema. Some remaining points to discuss about:

  1. You are right, we can keep it simple on Cassandra. The previous solution, using MySQL, had to implement an MD5 hash of the composite key for improving the performance. We can start from what you propose: ExperimentID_ TrialID, but then we might want to get the MD5 hash of this string to enhance the performance retrieval of the data. It depends on how it works on Cassandra about performance of hash vs string retrieval and internal optimizations Cassandra does. Maybe it already use hashes internally, even if we define strings.

  2. How it was on Marco's thesis? I guess it was an AUTO_INCREMENT. If this is the case, just use uuid also for these.

  3. Because you can have more repetition of the same ExperimentID, and the PRIMARY KEY should be unique.


  1. For PRIMARY KEYs / Column Ordering (or Clustering) Keys: ok, thank you for the explanation. Can you please write down for each of the definition, why do we have that columns?

  1. EnvironmentData:

    • how can I store the information for every interface, in your definition?
    • cpu_stats: how can I store the information for each CPU. How do you compute the percentage with respect to the total usage?
  2. DockerContainerProperties:

    • Change the name of DockerContainerProperties to ContainerProperties
    • Where is the EnvironmentID we use as reference in EnvironmentData
    • What is the datatype of cpu_shares?
    • Why privileged is a text? Shouldn't it be a boolean? The same applies to read_only. Just cross-check also the other fields.
    • I guess you missed: runtime constraints on the resource. The ones missing from (x)

In general, for all the schema: We should be consistent about the naming of the fields: the underscore_case notation is fine.

Cerfoglg commented 8 years ago

@VincenzoFerme

  1. I think Cassandra does it on its own, so we shouldn't have to worry about hashing ourselves.

  2. He was probably using an auto increment.

  3. I was trying to order by relevant IDs, like Env data is ordered by EnvironmentID and TrialID, Experiment by ExperimentID, Process by TrialID, like so.

  4. Yeah you're right, I'll change that to use TrialID.

  5. The network data is all provided in the Docker stats just like that (r for receive, t for transfer). As for CPUs, we have discussed this early in the semester: with just the Docker Stats API we can't calculate CPU percentages. The stats only give you a total usage, and then an array representing per core usage, without telling you which core is which, or what their max would be. I'm afraid we may have to make use of these data as they are, because the alternative could end up being extremely complex to implement.

Anyways, will update the definitions with these observations.

VincenzoFerme commented 8 years ago

@Cerfoglg

  1. Assign the uuid type to the field, if you are using uuid to get the unicity.

  2. Ok. Since it matter on the performance, just consider that we are going to access the data mostly by Trial_ID and Experiment_ID. This means you can have repetition of these fields in all the tables (e.g., in Constructs), for performance purposes, and also setup the schema so that the retrieval of the data using these two IDs is fast. If this is the scope of the ordering. We should order before for trial_id because it is more selective.

  3. Why the last fields of Container_Properties are lowercase? Be consistent with naming all over the tables.

  4. DockerContainerProperties:

    • what is the difference between Container_ID and Environment_ID?
    • change the name of Cpu_set to Cpu_Set_Cpus if it correspond to this value defined by Docker
    • where do you define kernel-memory?
VincenzoFerme commented 8 years ago

@Cerfoglg

8.1 I'm going to answer you in details in the following.

First, how can I store the information the API returns when you have more than one network interfaces in you schema? Example from the Docker documentation:

...
"networks": {
             "eth0": {
                 "rx_bytes": 5338,
                 "rx_dropped": 0,
                 "rx_errors": 0,
                 "rx_packets": 36,
                 "tx_bytes": 648,
                 "tx_dropped": 0,
                 "tx_errors": 0,
                 "tx_packets": 8
             },
             "eth5": {
                 "rx_bytes": 4641,
                 "rx_dropped": 0,
                 "rx_errors": 0,
                 "rx_packets": 26,
                 "tx_bytes": 690,
                 "tx_dropped": 0,
                 "tx_errors": 0,
                 "tx_packets": 9
             }
     }
...

How can I store the following information returned by the Docker API, when I have more than one core for percpu_usage:

...
"cpu_stats" : {
        "cpu_usage" : {
           "percpu_usage" : [
              16970827,
              1839451,
              7107380,
              10571290
           ],
           ...
           "total_usage" : 36488948,
           ...
        },
        "system_cpu_usage" : 20091722000000000,
        "throttling_data" : {}
     }
...
VincenzoFerme commented 8 years ago

@Cerfoglg

8.2 Docker Stats API returns also a field named precpu_stats. On this reference you can find a python implementation relying on the precpu_stats data to compute the CPU usage. The referenced method also solves the problem of normalising the CPU percentage per number of cores. If it works as expected we should obtain a number in the range [0,100].

VincenzoFerme commented 8 years ago

@Cerfoglg add also a table about Host_Properties, mapped to the Container_Properties table through the current Environment_ID. Then rename it to Host_ID. The table should contain the data obtained from the following two APIs:

Cerfoglg commented 8 years ago

@VincenzoFerme Made changes based on your comments: added the new table for host properties, plus the network interface table for the data related to network interfaces per environment data, and using lists for percpu and network interfaces in environment data.

VincenzoFerme commented 8 years ago

@Cerfoglg thank you for the adding. Some notes:

Cerfoglg commented 8 years ago

@VincenzoFerme Modified the schema again based on your comments, except for these points:

VincenzoFerme commented 8 years ago

@Cerfoglg thank you for the update. Some remaining points:

Cerfoglg commented 8 years ago

@VincenzoFerme Fixed the points above, plus I added experiment_id and trial_id to all tables that didn't have it, and used them for column ordering (see PRIMARY KEY definitions)

Cerfoglg commented 8 years ago

@VincenzoFerme One last thing, rearranged the clustering order in the PRIMARY KEY definitions

Cerfoglg commented 8 years ago

@VincenzoFerme I think what we also need is replication_num in tables like process, because if we are trying to analyse on a certain rep number, we should probably have it available.

In fact, do we even need trial_id, if we know that it's just the concatenation of experiment_id and replication_id? And do we need the Experiment table for that matter?

VincenzoFerme commented 8 years ago

@Cerfoglg trial_id replaces replication_num. We can access the data of a certain replication of an experiment by directly using the trial_id.

We need the experiment table for storing the existence of experiment entities, with certain assigned ids.

Cerfoglg commented 8 years ago

@VincenzoFerme Fixed that. Also using trial_id and experiment_id for partition keys.

VincenzoFerme commented 8 years ago

@Cerfoglg thank you for the effort. I made the following changes:

If you need to change something, please do it on: https://gist.github.com/VincenzoFerme/c3f142935cc0f89a99c9 (the gist has been removed because out of date)

VincenzoFerme commented 8 years ago

@Cerfoglg something we should evaluate to enhance the performance of our schema definition:

If you need to change something, please do it on: https://gist.github.com/VincenzoFerme/c3f142935cc0f89a99c9 (the gist has been removed because out of date)

Cerfoglg commented 8 years ago

@VincenzoFerme

VincenzoFerme commented 8 years ago

@Cerfoglg