Fixes 83 (better performance for sparse/high cardinality data)

FeatureBaseDB / tools

Tools for development and ops

BSD 3-Clause "New" or "Revised" License

20 stars 14 forks source link

Fixes 83 (better performance for sparse/high cardinality data) #88

Closed yuce closed 5 years ago

yuce commented 5 years ago

Adds --no-import and --print-out flags. ~- Adds fast and probability options.~ ~- When fast the density calculation is skipped and a bit is set when randf < probability.~

~fast seems to cut the time to generate bits in half.~

yuce commented 5 years ago

I'll add a generator which produces bitcount number of bits and randomly assigns it to rows. It can save and load generated random numbers. Using 100 million cached random numbers to produce 1 billion bits takes about 1 minute (loading the cache is included). Is that the level of desired speed? Do we seek to be faster than that?

yuce commented 5 years ago

Generating 1 million bits for 100k rows using ^ takes 2 seconds.

yuce commented 5 years ago

Most of that 2 seconds seem to be loading the cached bits. Using 1 million cached ~bits~ random ints, 100k rows, 1 million bits takes 0.15 seconds.

yuce commented 5 years ago

With the added zipf distribution of bits per row, generating total 1 billion bits for 100k rows takes 1:15 mins on my machine:

time imagine --no-import imagine/sample_fast.toml

yuce commented 5 years ago

Notes:

The order of the rows are randomized, so the row with the greatest number of bits is not necessarily the first column. This can be trivially changed.
Random ints are generated before producing bits, in order to be able to load them from a file. This can be prohibitive to generate a great number of unique numbers (although currently any number of bits can be produced from a small number of random numbers). If this is not needed, it can be trivially removed.

yuce commented 5 years ago

Not sure about the ~OOM~ crashing issue. This is what I get using --no-import:

$ time imagine --no-import imagine/samples/jaffee.toml
server memory: 32GB [31962MB]
server CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz [4 physical cores, 8 logical cores available]
cluster nodes: 1
total bits: 1000000

done.

real    0m0.122s
user    0m0.112s
sys 0m0.028s

yuce commented 5 years ago

Using without --no-import makes it consume too much memory. I've tested with printing out the bits to a csv file and using pilosa import. The memory usage grows a lot!

yuce commented 5 years ago

I've added a uniqueColumns option to constrain the number of random bits generated. So if columns == 1000 and uniqueColumns == 10 the same column would be used 100 times.

yuce commented 5 years ago

Also, I've changed fast -> fastSparse and removed global fastSparse. Fields should have fastSparse == true to enable that.

yuce commented 5 years ago

I've removed loadBits and changed the generator to use the density parameter and start setting the smaller bits first. I've removed randomization of bits per row, so the smaller rows always have more bits. It is considerably slower now though, produces 30 million bits per minute. I'll update it make it faster.

yuce commented 5 years ago

I've added a ShardWidth parameter to the index spec but it is not being used now.

jaffee commented 5 years ago

I wouldn't worry too much about making it faster at this point... I haven't verified the results yet, but with the following config, it's orders of magnitude faster:

version = "1.0"
[indexes.newsers]
columns = 10000000
fields = [
{name = "fastsparse", type = "set", min=0, max=10000, zipfA=1.0, fastSparse = true, density = 0.1 },
{name = "sparse", type = "set", min=0, max=10000, zipfV = 3.0, zipfS = 2.0, density = 0.1 },
]
[[workloads]]
name = "sample"
threadCount = 1
tasks = [
{ index = "newsers", field = "fastsparse", dimensionOrder="row" },
{ index = "newsers", field = "sparse",  dimensionOrder="row" },
]

yuce commented 5 years ago

@jaffee Is it OK to merge this ?

jaffee commented 5 years ago

yes! thanks @yuce!

yuce commented 5 years ago

Great!