Closed vweevers closed 5 years ago
Interesting bit from the YCSB paper:
The workload client must make many random choices when generating load: which operation to perform (Insert, Update, Read or Scan), which record to read or write, how many records to scan, and so on. These decisions are governed by random distributions. YCSB has several built-in distributions:
- Uniform: Choose an item uniformly at random. For example, when choosing a record, all records in the database are equally likely to be chosen.
- Zipfian: Choose an item according to the Zipfian distribution. For example, when choosing a record, some records will be extremely popular (the head of the distribution) while most records will be unpopular (the tail).
- Latest: Like the Zipfian distribution, except that the most recently inserted records are in the head of the distribution.
- Multinomial: Probabilities for each item can be specified. For example, we might assign a probability of 0.95 to the Read operation, a probability of 0.05 to the Update operation, and a probability of 0 to Scan and Insert. The result would be a read-heavy workload.
Also see the HyperDex benchmarks (which used YCSB)
Progress: wrote a pseudo-random key generator, based on a seed (meaning each benchmark can use the same set of keys) and a distribution (uniform or zipfian). The y-axis here shows the generated keys as integers for visual purposes; they are actually encoded with lexicographic-integer
(hex or binary).
(the keys in the zipfian distribution lean towards 0, maybe i'm misusing it, if not, zero-padding or offsetting is gonna be important so that the most frequently used keys don't also have the least bytes)
the keys in the zipfian distribution lean towards 0
Happened because I used a skew
parameter (explained here) of 1. When skew
is 0 it correctly results in a uniform distribution. I compared the results to a simpler uniform algorithm (random() * N
) (both using the same seeded PRNG) and the results are very close:
Edit: I went with a different algorithm (that is uniform but not random() * N
) in the end, disregard this.
There are differences between write-random/-sorted
and write
that we need to consolidate:
write-random/-sorted
collects duration per opwrite
collects average duration per 1000 ops, and throughput in MB/s, averaged over timeWhile (1) is exact and paints a more complete picture, it's also more data to plot - which gnuplot
can barely handle. I definitely want to keep the throughput metric ~but I prefer to collect the average throughput of the last 1000 ops. And add a command-line argument to control the size of that window (default is 1000, if 1, aggregation is effectively disabled, if N, you get a cumulative average).~1
We could record values into native-hdr-histogram
or student-histogram
and then also display statistics like mean, stddev and relative margin of error as text in the plot.
We could also make that "window" dynamic, à la benchmark.js
(JSPerf) which has a warmup step to find the number of ops needed per observation to make timings statistically significant.2
1 Down the line we should write raw, non-aggregated values to the CSV file and derive new data from that when needed, but atm that would complicate the integration with gnuplot, having to write derived data to intermediary CSV files. gnuplot itself can also compute stats but I want to move away from gnuplot, not become more dependent.
2 For future reference, the minimum duration of an observation is Math.max(resolutionInSeconds() / 2 / 0.01, 0.05)
for a certainty of 99% where resolutionInSeconds()
is the average smallest measurable time with process.hrtime()
.
And add options for the keys and values, e.g.
--keys/--values <type>
wheretype
is one ofrandom
,seq
,seqReverse
. Optionally combined with--keySize/--valueSize
.And rename the merged benchmark to
put
.The next steps might be to:
--length-distribution zipfian
) (https://github.com/Level/levelup/issues/227)level-bench run workload --read-proportion 0.4 --update-proportion ..
).