cathugger / mkp224o

vanity address generator for tor onion v3 (ed25519) hidden services
Creative Commons Zero v1.0 Universal
1.26k stars 149 forks source link

Benchmark/Expected time for a N-digits address? #27

Open stevefan1999-personal opened 4 years ago

stevefan1999-personal commented 4 years ago

I want to know the average case of getting the desired vanity address on various CPUs, compilers, OS, etc.

I compiled my mkp224o with clang 7 and march=native, in Debian Buster and with 2x Xeon E5-2670

hello:

./mkp224o hello -B -S 5 -j 16 -d keys
>calc/sec:5095952.833017, succ/sec:0.000000, rest/sec:159.888078, elapsed:0.100070sec
hello23twa7k536qggxfeqm4orwohwlca3ln6f43b4splcym57msaxid.onion

hellow:

./mkp224o hellow -B -S 5 -j 16 -d keys
...
>calc/sec:7007205.907981, succ/sec:0.000000, rest/sec:0.000000, elapsed:15.111171sec                           [27/9839]
hellowthsxpoy4pq4vensv2drz6qv562d3sfqg5pk4nwqnik4hii53ad.onion

hellowo:

./mkp224o hellowo -B -S 5 -j 16 -d keys
...
>calc/sec:7318805.372164, succ/sec:0.000000, rest/sec:0.000000, elapsed:155.117082sec
hellowoy3ahurx54upkhdut2ww6kwsmzrrtve6imdul4f5yr4kpsb2id.onion

Should I need more samples so that we can calculate the variance? This is pretty random at best.

cathugger commented 4 years ago

Nature of the way it works is, in fact, random. Single char is 5 bits (base32) so probability of hitting right thing shrinks 2^5=32 times by each added character. You'd need to run it longer to calculate averange. I'd recommend first benchmarking differences with lower char count as higher hit rate will demand less total benchmark time to smooth out noise. Best way to benchmark for purpose of speed improvement (like picking different compilers/processing backends) is looking at calc/sec stat, longer char counts shouldn't affect it. Properly averanged over longer period of time hit benchmark shouldn't diverge significantly from theorethical prediction of 32 times less hit rate each added character. Due to probablistic nature of it, there can always be surprises and disappointments, even if you know averange time it can find it sooner or take longer.

Steve Fan notifications@github.com on Fri, 20 Dec 2019 19:09:30 -0800 in cathugger/mkp224o/issues/27@github.com wrote:

I want to know the average case of getting the desired vanity address on various CPUs, compilers, OS, etc.

I compiled my mkp224o with clang 7 and march=native, in Debian Buster and with 2x Xeon E5-2670

hello:

./mkp224o hello -B -S 5 -j 16 -d keys
>calc/sec:5095952.833017, succ/sec:0.000000, rest/sec:159.888078, elapsed:0.100070sec
hello23twa7k536qggxfeqm4orwohwlca3ln6f43b4splcym57msaxid.onion

hellow:

./mkp224o hellow -B -S 5 -j 16 -d keys
...
>calc/sec:7007205.907981, succ/sec:0.000000, rest/sec:0.000000, elapsed:15.111171sec                           [27/9839]
hellowthsxpoy4pq4vensv2drz6qv562d3sfqg5pk4nwqnik4hii53ad.onion

hellowo:

./mkp224o hellowo -B -S 5 -j 16 -d keys
...
>calc/sec:7318805.372164, succ/sec:0.000000, rest/sec:0.000000, elapsed:155.117082sec
hellowoy3ahurx54upkhdut2ww6kwsmzrrtve6imdul4f5yr4kpsb2id.onion

Should I need more samples so that we can calculate the variance? This is pretty random at best.

YoRyan commented 4 years ago

The math (a geometric discrete probability distribution) tells us that, on average, it will take 32^(# of chars) iterations to find a desired address. So take that number and divide by your calcs/sec to get the expected run time until your first match.

For example, my i7-3770 does ~15,400,000 calcs/sec. To find a 7-character prefix, the anticipated run time is

32^7/15400000 = 2231.15 secs = 37 mins

And indeed, it took me 40 minutes to find my first address.

(Note that this compares very favorably with shallot, which quotes "1 day" for a 1.5GHz CPU and a 7-character prefix.)

stevefan1999-personal commented 4 years ago

The math (a geometric discrete probability distribution) tells us that, on average, it will take 32^(# of chars) iterations to find a desired address. So take that number and divide by your calcs/sec to get the expected run time until your first match.

For example, my i7-3770 does ~15,400,000 calcs/sec. To find a 7-character prefix, the anticipated run time is

32^7/15400000 = 2231.15 secs = 37 mins

And indeed, it took me 40 minutes to find my first address.

(Note that this compares very favorably with shallot, which quotes "1 day" for a 1.5GHz CPU and a 7-character prefix.)

Did this means it takes for any x in length of digits, it should theoretically take 2^(5x) tries to have at least one match? This is huge 🤔

YoRyan commented 4 years ago

Correct! But it's just an average. You could get especially lucky or unlucky and find something sooner or later.

cathugger commented 4 years ago

(Note that this compares very favorably with shallot, which quotes "1 day" for a 1.5GHz CPU and a 7-character prefix.)

"1.5GHz CPU" isn't very descriptive, though, and that table is quite outdated (history), so I'd say it's just difference of CPU powers, and docs being outdated. One of the reasons I didn't put any stats in README.txt in the first place.

Did this means it takes for any x in length of digits, it should theoretically take 2^(5x) tries to have at least one match?

No. It means, that on average you should get one match every 2^(5x)=32^x tries. That's a little bit different. You can get very lucky and hit much sooner. Or very unlucky and hit much later. I've seen both happening. Statistically, yeah, you'll get one match every 32^x tries (if using only one filter of x chars), but for that to show you'll need to average over much more than one hit, so I'm not sure if such expectation formula is very trustworthy when N=1. This stuff is sorta confusing, but I just don't wanna overpromise such things in readme, especially when most users want only one or few addresses.

scribblemaniac commented 4 years ago

I agree with the math that's been presented here so far. If you want to be a bit more conservative, you can calculate the number of trials t it would take to get at least 1 match with n characters and a probability p with the formula t = log(1-p)/log(1-1/32^n). This comes from the distribution X~Binomial(t, 1/32^n) and P(X>=1)=p. Here's a table of what the number of trials looks like for some values common values. The number of digits are on the left and the probability of finding at least one success along the top:

50% 80% 90% 95% 99%
2 710 1648 2357 3067 4714
3 22713 52738 75450 98163 150900
4 726818 1687618 2414435 3141252 4828869
5 23258160 54003775 77261934 100520094 154523868
6 744261118 1728120799 2472381917 3216643035 4944763834
7 23816355775 55299865590 79116221365 102932577139 158232442729

The way to read this is like so: "It is expected that you will find at least one match for a 2 character search after 710 trials 50% of the time." So while yes on average it will take about ~24 billion trials to find a match for a 7 digit prefix, it will take ~158 billion trials to be 99% sure that you will find a match.

cathugger commented 4 years ago

This is very cool, I knew something like this was doable but didn't remember exact maths for this. I almost wanna write small utility to calculate these and print into terminal now. That'd probably be pretty useful, to get such table with expected times, from input of calc/sec stat.

scribblemaniac notifications@github.com on Sun, 22 Dec 2019 10:33:24 -0800 in cathugger/mkp224o/issues/27/568291087@github.com wrote:

I agree with the math that's been presented here so far. If you want to be a bit more conservative, you can calculate the number of trials t it would take to get at least 1 match with n characters and a probability p with the formula t = log(1-p)/log(1-1/32^n). This comes from the distribution X~Binomial(t, 1/32^n) and P(X>=1)=p. Here's a table of what the number of trials looks like for some values common values. The number of digits are on the left and the probability of finding at least one success along the top:

50% 80% 90% 95% 99%
2 710 1648 2357 3067 4714
3 22713 52738 75450 98163 150900
4 726818 1687618 2414435 3141252 4828869
5 23258160 54003775 77261934 100520094 154523868
6 744261118 1728120799 2472381917 3216643035 4944763834
7 23816355775 55299865590 79116221365 102932577139 158232442729

The way to read this is like so: "It is expected that you will find at least one match for a 2 character search after 710 trials 50% of the time." So while yes on average it will take about ~24 billion trials to find a match for a 7 digit prefix, it will take ~158 billion trials to be 99% sure that you will find a match.

YoRyan commented 4 years ago

"1.5GHz CPU" isn't very descriptive, though, and that table is quite outdated (history), so I'd say it's just difference of CPU powers, and docs being outdated.

It also reflects the fact that mkp224o is a lot faster, especially with the batching code enabled. ;)

This stuff is sorta confusing, but I just don't wanna overpromise such things in readme, especially when most users want only one or few addresses.

Right, there is always some uncertainty involved, but as we've seen, there is a way to quantify the exact probabilities. It is possible to be more precise than "some time within the next 30 seconds or next 30 years."

DadiBit commented 4 years ago

You can calculate the number of trials t it would take to get at least 1 match with n characters and a probability p with the formula t = log(1-p)/log(1-1/32^n).

In C++ (I think in C too) it's a bit faster log(1-p)/log(1-1/32)*pow(32,n-1); rather then log(1-p)/log(1-1/pow(32,n));. This is of course a one time calculation before running the program and requires way less the a ms, so it shouldn't matter too much.