ge-high-assurance / RACK

DARPA's Automated Rapid Certification of Software (ARCOS) project called Rapid Assurance Curation Kit (RACK)
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

Determine a benchmarking strategy for RACK #338

Open davearcher opened 3 years ago

davearcher commented 3 years ago

start

davearcher commented 3 years ago

Benchmarking Approach (Draft) for RACK

This document proposes a strategy for benchmarking the RACK data curation suite. Comments are welcome!

Commonplace Approaches to Benchmarking

Database benchmarking has a long history in industry, most notably the TPC family of benchmark suites. Common to those benchmarking approaches are that

In the case of RACK, we similarly need to benchmark key functions, using a representative schema and query clause set, including all tooling touched in commonplace workflows.

How to Benchmark RACK

At present, there seem to be 3 RACK workflows: data ingestion, data curation (checking cardinalities and types), and data query. So, it makes sense to have a benchmark suite for each of these flows. Later on, as we consider entity resolution for example, we'll need to either consider it separately for benchmarking, or include it in the data curation flow.

In deciding how to reason about results of our benchmarking, we need to take into account that repetitive operations such as queries will dominate the user's perception of performance and efficiency. So, we may want to emphasize query performance as most important, with ingest and curation important but secondary to query.

We'll need to educate ourselves and our audience on how to interpret benchmark results. Absolute comparison from release to release can measure whether our directions for enhancement are helpful. Relative comparison against the current state of the art can point out the value of the ARCOS program and thus be useful to the Program Manager. The shape of benchmark curves can assist in identifying inefficiencies in our approach with respect to scale. Unfortunately, it's doubtful that benchmark results will give a particular user a useful snapshot view of how RACK will perform on their data, but that's not the point of benchmarking...

What Measures to Use in Assessing RACK Performance

Typical benchmarks attempt to measure both impact to users of the system and resource utilization of the system. The most familiar benchmarks are user impact results, typically measured in "wall clock time". Resource utilization measurements aim to assess efficiency rather than user impact: how many CPU cycles are used (either in total or in terms of an average rate) during the benchmark, how much dynamic storage is used (typically the maximum number of bytes of RAM used) during the benchmark, and how much persistent storage is used to store data between uses (typically the number of bytes of mass storage used).

Typically, in a client-server distributed system, benchmark measurements are made on the relevant server, because for many applications the server is where the bulk of computation occurs, and because client system characteristics may vary dramatically. Thus we ignore client-side processing, for example in submission of queries or presentation of query results.

So, we should have a strategy that measures

Because these measures may not scale linearly as the amount of data stored or queried grows, it makes sense to have a set of benchmarks that measure the above characteristics at a variety of input data sizes. The independent variable - size of the input data - is a proxy for a more complex formula that represents its complexity as well, but there are no good measures (AFAIK) for that complexity. Thus we should use overall number of bytes of "raw" input data (all the bytes input via CSV files) as the independent variable for measuring in terms of scale of data.

Because these measures (or at least the first 3 of 4) may not scale linearly as the complexity of schema or queries grows, we need to decide what to do about measuring in terms of those independent variables as well. With regard to complexity of schema, it seems natural to normalize that variable out of the analysis by making all measurements using the current production release schema supported by RACK in its core ontology. With regard to complexity of queries, it seems natural to construct a query benchmark set that is a mix of queries that represent known use cases. With these two factors thus normalized, the remaining independent variable for benchmark measurements is the size of raw input data, described above.

Where to Measure RACK

Some RACK workflows occur once, with results used many times. The flow that builds node groups and CSV templates for ingestion is one example. These flows are relatively immaterial in characterizing RACK performance. "On-line" flows do affect performance with each use, however. So, we should have a benchmarking strategy that measures

Measuring Data Ingestion

Unfortunately, we have as yet no idea which parts of our schema will be used in what proportion (if at all). So, we take a performance-pessimistic approach.

We measure data ingestion starting with pre-generated node groups (CSV templates), thus focusing on the on-line flow. To generate the raw input data (our independent variable for measurements), we enhance the ASSIST tool to generate the needed CSV files, since that tool already has an intimate understanding of the schema in each release version. This approach allows us to enhance our benchmarks as our schema evolves. (Note, however, that if we do take advantage of enhancing the baseline data as our production release schema evolves, we will compromise to some degree our ability to compare performance longitudinally.)

Using ASSIST as an automated way to generate data has the additional advantage that it can automatically accommodate the right "order" of ingestion. Also, note that by ingesting things in order, our benchmark can measure the impact of the necessary "lookups" required for ingesting data with foreign key relationships.

ASSIST will take as input in this setting the core schema of RACK, and a list of benchmark database sizes to create. ASSIST will synthesize data for each attribute type, and synthesize relationship instances for each entity class. The risk here is that our measurements will be pessimistic relative to actual use, where we expect not all entity instances to make use of all relationships or attributes. As we gain insight into actual use of RACK by TA1 performers, we may be able to enhance our benchmark suite to more closely represent "typical" occupancy of attributes (for example, string lengths, and which fields are often left NULL) and relationships (for example, which linkages are often unpopulated). However, we should begin with a "worst case" suite.

Measuring Data Curation

This portion of benchmarking is perhaps the simplest of the three. The idea here is to run the ASSIST tool across each benchmark data set, and measure CPU time, memory, and wall clock time consumed while it runs. Unfortunately, our choice to allow flexibility in the data model means that the curation process will likely be the slowest of all as well, because of the extensive scans required to find things that exist in the database but outside the data model.

Measuring Data Query

Unfortunately, we have no idea what queries will be commonly used by TA3 performers. So, we begin with simple queries, and over time enhance our benchmark query set as TA3 performers contribute queries that they consider representative workloads. We will likely adapt those queries somewhat to be amenable to our randomly generated synthetic data, while keeping the computational and access complexity the same as the original contributed query.

To measure query performance, we use the same loaded databases as generated for the ingest and curation benchmarks. We assume that query results go to one or more files in the filesystem, to avoid impact of visualization workload that may be shared between client and server. We begin by manually constructing a set of queries that span the schema, including both simple and complex (path-following / recursive) queries. We review this set with TA3 performers to get their alignment that this set serves as a reasonable starting point. Note that the benchmark query set we devise may not be meaningful semantically - which doesn't matter much.

Putting this Benchmark Plan into Action

The natural steps in bringing this benchmark plan into action seem to be:

kityansiu commented 3 years ago

In deciding how to reason about results of our benchmarking, we need to take into account that repetitive operations such as queries will dominate the user's perception of performance and efficiency. So, we may want to emphasize query performance as most important, with ingest and curation important but secondary to query.

Just wanted to point out that ingestion can also dominate the user's perception of performance, as evidenced by the following from GrammaTech: https://github.com/ge-high-assurance/RACK/issues/285.