Open VincenzoFerme opened 9 years ago
The database schema in Marco Argenti's thesis (figure 5.1) is a good place to start. In that, we have the following tables and columns:
These can be easily written in Cassandra as column families. However, Cassandra is a schemaless, distributed database, which raises few problems. Let's assume that we want to look for environmental data with a certain trial ID. If we were to query Cassandra with something like SELECT * FROM EnvironmentData WHERE TrialID = ***, we may end up having to read several nodes of our Cassandra cluster, which has a significant impact on performance. To counter this, it's a good idea to try creating inverted indexes to map important values, such as TrialID to its associated environmentData. This way, we avoid having queries going to more than one node, which is better. The inverted indexes we should use are:
Attached is a quick sketch showing Marco's database schema, plus the additional indexes on the side.
@Cerfoglg thank you for the description.
Some notes and TODOs:
We should also consider when it makes sense to add Materialized View to access data in a more performant way: http://www.datastax.com/dev/blog/new-in-cassandra-3-0-materialized-views
Actually, looking at materialised views, if I'm understanding it correctly, we may just want to use those instead of creating explicit inverted indexes, as they help solve the same issue. Opinions?
@Cerfoglg it seems. Give a look at the following three references, in order to came up with the right approach according to our requirements:
Write down PROs and CONs of both of the approaches. First clearly write down the requirement we have in writing and reading data.
Attaching here the code to create the Cassandra database. To use this, run cqlsh with option -f to pass this file.
CREATE KEYSPACE benchflow
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
USE benchflow;
CREATE TABLE environment_data (
environment_data_id uuid,
container_properties_id uuid,
read_time text,
cpu_percpu_usage list<bigint>,
cpu_total_usage bigint,
cpu_percent_usage float,
cpu_throttling_data map<text, bigint>,
memory_usage bigint,
memory_max_usage bigint,
network_interfaces map<text, uuid>,
experiment_id text,
trial_id text,
PRIMARY KEY ((trial_id, experiment_id), environment_data_id, container_properties_id)
);
CREATE TABLE network_interface_data (
network_interface_data_id uuid,
network_rx_dropped bigint,
network_rx_bytes bigint,
network_rx_errors bigint,
network_tx_packets bigint,
network_tx_dropped bigint,
network_rx_packets bigint,
network_tx_errors bigint,
network_tx_bytes bigint,
PRIMARY KEY (network_interface_data_id)
);
CREATE TABLE experiment (
experiment_id text,
replication_num int,
trial_id text,
PRIMARY KEY ((trial_id, experiment_id))
);
CREATE TABLE process (
process_instance_id uuid,
source_process_instance_id text,
process_definition_id text,
start_time timestamp,
duration bigint,
end_time timestamp,
experiment_id text,
trial_id text,
PRIMARY KEY ((trial_id, experiment_id), process_instance_id)
);
CREATE TABLE construct (
construct_instance_id uuid,
source_construct_instance_id text,
construct_type text,
construct_name text,
start_time timestamp,
duration bigint,
end_time timestamp,
process_instance_id uuid,
experiment_id text,
trial_id text,
PRIMARY KEY ((trial_id, experiment_id), construct_instance_id, process_instance_id)
);
CREATE TABLE container_properties (
container_properties_id uuid,
container_id text,
experiment_id text,
trial_id text,
host_id uuid,
environment map<text, text>,
image text,
labels list<text>,
links list<text>,
log_driver text,
u_limits map<text, int>,
volume_driver text,
volumes_from list<text>,
cpu_shares int,
cpu_set_cpus text,
cpu_set_mems text,
cpu_quota int,
cpu_period int,
blkio_weight int,
mem_limit text,
mem_swap_limit text,
mem_reservation_limit text,
mem_kernel_limit text,
memory_swappiness int,
oom_kill_disable boolean,
privileged boolean,
read_only boolean,
restart text,
user text,
name text,
network text,
restart_policy text,
PRIMARY KEY ((trial_id, experiment_id), container_properties_id, host_id)
);
CREATE TABLE host_properties (
host_id uuid,
cpu_cfs_period boolean,
cpu_cfs_quota boolean,
debug boolean,
discovery_backend text,
docker_root_dir text,
driver text,
driver_status list<text>,
execution_driver text,
experimental_build boolean,
http_proxy text,
https_proxy text,
ipv4_forwarding boolean,
index_server_address text,
init_path text,
init_sha1 text,
kernel_version text,
labels list<text>,
mem_total bigint,
memory_limit boolean,
n_cpu int,
n_events_listener int,
n_fd int,
n_goroutines int,
name text,
no_proxy text,
oom_kill_disable boolean,
operating_system text,
swap_limit boolean,
system_time text,
server_version text,
docker_version text,
docker_os text,
docker_kernel_version text,
docker_go_version text,
docker_git_commit text,
docker_arch text,
docker_api_version text,
docker_experimental boolean,
alias text,
external_ip text,
internal_ip text,
hostname text,
purpose text,
PRIMARY KEY(host_id)
);
Definitions:
Setup:
@VincenzoFerme I updated the schema a bit. For environmentData I'm using uuid type for ID, because I'm now generating uuid values within the stats transformer script, making sure rows receive unique IDs.
I'm leaving the materialized view definitions, though as of right now I've only used Cassandra 2.2.3 for testing, as Cassandra 3 doesn't currently work with the pyspark cassandra connector.
@Cerfoglg Looking at the schema you provided in https://github.com/benchflow/data-transformers/issues/6#issuecomment-160972532, I have the following questions:
The EnvironmentData should contain the following information, as described on the Docker Stats API documentation:
The name of the field in our schema should be meaningful, to simply map the value to the actual metrics: e.g., memory_stats -> usage should be called memory_usage
EnvironmentData, EnvironmentID reference: this should reference to a table where we store the information of the environment in which the data have been collected. Think of it as the metadata describing a Docker container's properties (e.g., provided CPU, provided RAM, network configuration and so on). For now, lets add a table to which point this reference, that contains the following data (some of them are optional), as described on the Docker Compose file reference:
and the following ones described on the Docker Run reference:
@VincenzoFerme Alright, I altered the file to fix things based on your suggestions, but there are still a few things left hanging:
If the TrialID is something like say, for ReplicationNumber 3 and ExperimentID a7 we have TrialID a7_3, then computing it from ReplicationNumber and ExperimentID should be pretty straight forward, right? I'm not sure I understood this correctly.
If they aren't given, then they have to be computed on the fly when store to Cassandra, because like I've mentioned: Cassandra has no AUTO_INCREMENT option to automatically create IDs like Mysql for example. Right now I use uuids for unique row IDs.
Shouldn't the Experiment table have primary key ExperimentID? TrialID is a composite of ExperimentID and ReplicationID, so it refers to specific runs of an experiment as well.
Also, note that in Cassandra when you write PRIMARY KEY (x, y, z), it doesn't mean that x, y and z are the primary key, it means that x is primary key, and y and z are column ordering keys. To write a composite key in Cassandra you write PRIMARY KEY ((x, y, z)). So in those tables I only had one primary key ^^
@Cerfoglg thank you for the detailed answer and for changing the schema. Some remaining points to discuss about:
You are right, we can keep it simple on Cassandra. The previous solution, using MySQL, had to implement an MD5 hash of the composite key for improving the performance. We can start from what you propose: ExperimentID_ TrialID, but then we might want to get the MD5 hash of this string to enhance the performance retrieval of the data. It depends on how it works on Cassandra about performance of hash vs string retrieval and internal optimizations Cassandra does. Maybe it already use hashes internally, even if we define strings.
How it was on Marco's thesis? I guess it was an AUTO_INCREMENT. If this is the case, just use uuid also for these.
Because you can have more repetition of the same ExperimentID, and the PRIMARY KEY should be unique.
EnvironmentData:
DockerContainerProperties:
In general, for all the schema: We should be consistent about the naming of the fields: the underscore_case notation is fine.
@VincenzoFerme
I think Cassandra does it on its own, so we shouldn't have to worry about hashing ourselves.
He was probably using an auto increment.
I was trying to order by relevant IDs, like Env data is ordered by EnvironmentID and TrialID, Experiment by ExperimentID, Process by TrialID, like so.
Yeah you're right, I'll change that to use TrialID.
The network data is all provided in the Docker stats just like that (r for receive, t for transfer). As for CPUs, we have discussed this early in the semester: with just the Docker Stats API we can't calculate CPU percentages. The stats only give you a total usage, and then an array representing per core usage, without telling you which core is which, or what their max would be. I'm afraid we may have to make use of these data as they are, because the alternative could end up being extremely complex to implement.
Anyways, will update the definitions with these observations.
@Cerfoglg
Assign the uuid type to the field, if you are using uuid to get the unicity.
Ok. Since it matter on the performance, just consider that we are going to access the data mostly by Trial_ID and Experiment_ID. This means you can have repetition of these fields in all the tables (e.g., in Constructs), for performance purposes, and also setup the schema so that the retrieval of the data using these two IDs is fast. If this is the scope of the ordering. We should order before for trial_id
because it is more selective.
Why the last fields of Container_Properties are lowercase? Be consistent with naming all over the tables.
DockerContainerProperties:
@Cerfoglg
8.1 I'm going to answer you in details in the following.
First, how can I store the information the API returns when you have more than one network interfaces in you schema? Example from the Docker documentation:
...
"networks": {
"eth0": {
"rx_bytes": 5338,
"rx_dropped": 0,
"rx_errors": 0,
"rx_packets": 36,
"tx_bytes": 648,
"tx_dropped": 0,
"tx_errors": 0,
"tx_packets": 8
},
"eth5": {
"rx_bytes": 4641,
"rx_dropped": 0,
"rx_errors": 0,
"rx_packets": 26,
"tx_bytes": 690,
"tx_dropped": 0,
"tx_errors": 0,
"tx_packets": 9
}
}
...
How can I store the following information returned by the Docker API, when I have more than one core for percpu_usage:
...
"cpu_stats" : {
"cpu_usage" : {
"percpu_usage" : [
16970827,
1839451,
7107380,
10571290
],
...
"total_usage" : 36488948,
...
},
"system_cpu_usage" : 20091722000000000,
"throttling_data" : {}
}
...
@Cerfoglg
8.2 Docker Stats API returns also a field named precpu_stats
. On this reference you can find a python implementation relying on the precpu_stats
data to compute the CPU usage. The referenced method also solves the problem of normalising the CPU percentage per number of cores. If it works as expected we should obtain a number in the range [0,100].
@VincenzoFerme Made changes based on your comments: added the new table for host properties, plus the network interface table for the data related to network interfaces per environment data, and using lists for percpu and network interfaces in environment data.
@Cerfoglg thank you for the adding. Some notes:
Host Properties
:containers
, images
and registry_config
related data, as written in my previous comment.driver_status
seems that should be [[""]]docker_
to the properties retrieved using the following API: https://docs.docker.com/engine/reference/api/docker_remote_api_v1.22/#show-the-docker-version-informationalias
type text, external_ip
type text, internal_ip
type text, hostname
type text, purpose
type textEnvironment_ID
. Then rename it to Host_ID
"environment_data
:throttling_data
should be an object: https://github.com/docker/docker/blob/master/api/types/stats.go#L8network_interfaces list<text>
has listnetwork_interface
you have environment_data_id uuid
?network_interface
:name
as key?@VincenzoFerme Modified the schema again based on your comments, except for these points:
throttling_data
I made into a map, so I can have the field of the throtthling data mapped to the value of it for convenience, and because it's better than making an entire new table.network_interfaces
field in environment_data
it's also now a map from the name of the interface to the uuid of the row in network_interface
containing the data for that interface. This should streamline things a bit.@Cerfoglg thank you for the update. Some remaining points:
throttling_data
and network_interfaces
: okenvironment_id
in environment_data
? Shouldn't it be container_id
now? Or you have the mapping in the Container_Properties
using the environment
field? If this is the case why this name and why it does not reference to the id
type uuid in environment_data
?Container_Properties
to container_properties
. And its field trial_ID
to trial_id
@VincenzoFerme Fixed the points above, plus I added experiment_id and trial_id to all tables that didn't have it, and used them for column ordering (see PRIMARY KEY definitions)
@VincenzoFerme One last thing, rearranged the clustering order in the PRIMARY KEY definitions
@VincenzoFerme I think what we also need is replication_num in tables like process, because if we are trying to analyse on a certain rep number, we should probably have it available.
In fact, do we even need trial_id, if we know that it's just the concatenation of experiment_id and replication_id? And do we need the Experiment table for that matter?
@Cerfoglg trial_id
replaces replication_num
. We can access the data of a certain replication of an experiment by directly using the trial_id
.
We need the experiment
table for storing the existence of experiment entities, with certain assigned ids.
@VincenzoFerme Fixed that. Also using trial_id and experiment_id for partition keys.
@Cerfoglg thank you for the effort. I made the following changes:
container_properties
: links
type converted to listcontainer_properties
: removed the following properties becasue part of the RegistryConfig
we don't want to store, as per https://github.com/benchflow/data-transformers/issues/6#issuecomment-163598484:
docker_io_mirrors text,
docker_io_name text,
docker_io_official boolean,
docker_io_secure boolean,
container_properties
: mem_limit
,mem_swap_limit
,mem_reservation_limit
,mem_kernel_limit
type converted to text, given that Docker allows you to specify the unitcontainer_properties
: volumes_from
type converted to listcontainer_properties
: u_limits
type converted to listsoft
field under nofile
). The idea of using a map is that you can convert the previous structure to the following key: nofile_soft
. @Cerfoglg is there a better way to map this types of structure in Cassandra?container_properties
: labels
type converted to listnetwork_interface
: type converted to bigint for all the fields but idprocess
and construct
: duration
type converted to bigingconstruct
: process_instance_id
type converted to uuidcontainer_properties
: host_id
type converted to uuidcontainer_properties
: added an additional id
in order to not rely on the container_id
returned by Docker because it can be the same for different containers runned on different hosts. I also updated all the references in all the other tables.network_interface
: renamed to id
to network_interface_data_id
network_interface
: renamed to network_interface_data
environment_data
: added environment_data_id
, it was missing an id. I also added it to the PRIMARY KEYhost_properties
: we don't need experiment_id
and trial_id
since they represent physical hosts on which we can execute more experiments.If you need to change something, please do it on: https://gist.github.com/VincenzoFerme/c3f142935cc0f89a99c9 (the gist has been removed because out of date)
@Cerfoglg something we should evaluate to enhance the performance of our schema definition:
DESC
order on trial_id
and/or experiment_id
, and for which table. Reference: http://www.planetcassandra.org/blog/we-shall-have-order/If you need to change something, please do it on: https://gist.github.com/VincenzoFerme/c3f142935cc0f89a99c9 (the gist has been removed because out of date)
@VincenzoFerme
@Cerfoglg
Discuss the design of the Cassandra database, similar to what has be done for Minio in #4