Closed yuce closed 5 years ago
I'll add a generator which produces bitcount number of bits and randomly assigns it to rows. It can save and load generated random numbers. Using 100 million cached random numbers to produce 1 billion bits takes about 1 minute (loading the cache is included). Is that the level of desired speed? Do we seek to be faster than that?
Generating 1 million bits for 100k rows using ^ takes 2 seconds.
Most of that 2 seconds seem to be loading the cached bits. Using 1 million cached ~bits~ random ints, 100k rows, 1 million bits takes 0.15 seconds.
With the added zipf distribution of bits per row, generating total 1 billion bits for 100k rows takes 1:15 mins on my machine:
time imagine --no-import imagine/sample_fast.toml
Notes:
Not sure about the ~OOM~ crashing issue. This is what I get using --no-import
:
$ time imagine --no-import imagine/samples/jaffee.toml
server memory: 32GB [31962MB]
server CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz [4 physical cores, 8 logical cores available]
cluster nodes: 1
total bits: 1000000
done.
real 0m0.122s
user 0m0.112s
sys 0m0.028s
Using without --no-import
makes it consume too much memory. I've tested with printing out the bits to a csv file and using pilosa import
. The memory usage grows a lot!
I've added a uniqueColumns
option to constrain the number of random bits generated. So if columns == 1000
and uniqueColumns == 10
the same column would be used 100 times.
Also, I've changed fast -> fastSparse
and removed global fastSparse
. Fields should have fastSparse == true
to enable that.
I've removed loadBits and changed the generator to use the density
parameter and start setting the smaller bits first. I've removed randomization of bits per row, so the smaller rows always have more bits. It is considerably slower now though, produces 30 million bits per minute. I'll update it make it faster.
I've added a ShardWidth
parameter to the index spec but it is not being used now.
I wouldn't worry too much about making it faster at this point... I haven't verified the results yet, but with the following config, it's orders of magnitude faster:
version = "1.0"
[indexes.newsers]
columns = 10000000
fields = [
{name = "fastsparse", type = "set", min=0, max=10000, zipfA=1.0, fastSparse = true, density = 0.1 },
{name = "sparse", type = "set", min=0, max=10000, zipfV = 3.0, zipfS = 2.0, density = 0.1 },
]
[[workloads]]
name = "sample"
threadCount = 1
tasks = [
{ index = "newsers", field = "fastsparse", dimensionOrder="row" },
{ index = "newsers", field = "sparse", dimensionOrder="row" },
]
@jaffee Is it OK to merge this ?
yes! thanks @yuce!
Great!
--no-import
and--print-out
flags. ~- Addsfast
andprobability
options.~ ~- Whenfast
the density calculation is skipped and a bit is set whenrandf < probability
.~~
fast
seems to cut the time to generate bits in half.~