Data Generation - Githubissues

beepy0 commented 5 years ago

[x] Understand data generation for this thesis (what type of data) ((unsigned) integers should do)
[x] Understand how the provided open-source data generators work
[x] Find seeded data generators (using some distribution, like zip/normal) to generate consistent data
[x] Make use of data generators to prepare a Zipfian distribution test data
[x] Make use of data generators to prepare a Uniform distribution test data, as a means of worst-case scenario
[x] Generate data with a discrete Normal Distribution pattern.

beepy0 commented 5 years ago

https://honingds.com/blog/python-random/

beepy0 commented 5 years ago

Test Data

While there are many different types of attributes to join on, we have decided that it is enough to test on a single data type since the goal is to measure update speed and nothing else. For that reason, we chose to supply the sketch algorithms with a vector of unsigned integers, each 32bit (4 Byte) in length.

The choice of data distributions was split into three categories - zipfian, normal-like with discrete values, and an uniform distribution. While the first two are more likely to represent real-world scenarios, a uniform distribution might point out to cache memory limitations since the data being processed is constantly changing and thus less likely to be kept in cache.

Two data sizes of 100k and 1000k for each distribution were pre-computed using Python's Numpy library (REF). Thet were then loaded into RAM before each set of runs to simulate rapid incoming data. Diagrams 1,2,3 quickly show the output of these distributions and sample code is also provided in the project under (Dir REF).

First baseline run impressions

Initial testing shows Fast-AGMS performing much faster under the current parameters and buckets/rows_no of 100. Doubling both buckets' and rows' sizes (TODO rerun again because of bugfixes) quadruples the runtime for AGMS while it only doubles for Fast-AGMS. Conversely, lowering the sizes to 10 each results in Fast-AGMS (TODO rerun again because of bugfixes) scaling almost linearly (~9.3 times faster execution time) while AGMS (TODO rerun again because of bugfixes) performing up to 109 times faster. This is a clear indication that AGMS does a lot of computation updating each random variable in its vector whilst Fast-AGMS's hash based updating mitigates that - something that could point towards a compute bound for the prior and thus benefit more from SIMD optimization alone. TODO This will be investigated further in the optimizations chapter. Another point for detailed analysis comes from using the larger data sample (1000k) instead. In this case the AGMS algorithm (TODO rerun again because of bugfixes) slows further compared to Fast-AGMS (36.8:1 vs 33.9:1 times slower runtime for AGMS) which indicates a bottleneck. TODO We have conducted multiple tests for the baseline and have plotted the resulting performance degradation for each based on data size.

beepy0 / thesis

Data Generation #14

Test Data

First baseline run impressions