Pyrit does not scale well for multiple GPUs

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

2 videocard installed, everything installed and running, here a benchmark

#1: 'CAL++ Device #1 'ATI CYPRESS'': 82426.3 PMKs/s (RTT 2.4)
#2: 'CAL++ Device #2 'ATI JUNIPER'': 41805.7 PMKs/s (RTT 2.6)
#3: 'CPU-Core (SSE2)': 655.1 PMKs/s (RTT 3.0)
#4: 'CPU-Core (SSE2)': 691.0 PMKs/s (RTT 2.9)
#5: 'Network-Clients': 0.0 PMKs/s (RTT 0.0)

when I run a real test on 10 million passwords as:

localhost:~# time pyrit -e test -r wpa.cap -i list.txt attack_passthrough
Parsing file 'wpa.cap' (1/1)...
Parsed 5 packets (5 802.11-packets), got 1 AP(s)

Picked AccessPoint 00:0d:93:eb:b0:8c automatically...
Tried 10000000 PMKs so far; 87027 PMKs per second.

Password was not found.

real    2m9.549s
user    5m33.769s
sys     0m30.366s

the PC needs 129 sec to complete 10 million password, it means 77500 PSK/s

I also do another test: create essid, fill the database with passwords an run 
batch, here the result:

localhost:~# time pyrit batch

Connecting to storage at 'file://'...  connected.
Working on ESSID 'TEST'
Processed all workunits for ESSID 'TEST'; 104442 PMKs per second.d.

Batchprocessing done.

real    1m51.233s
user    4m29.477s
sys     0m34.406s

the PC needs 111 sec to complete 10 million password, it means 90090 PSK/s

What is the expected output? What do you see instead?
Expected output is to have in all case the PMK minimum at 120000

What version of the product are you using? On what operating system?
pyrit 276
lunux 2.6.32-5-amd64

Please provide any additional information below.
calpp 0.87 (laast avayable)
ATI drivers 10.7
ATI stream 2.1

note that with previous installation with 

-pyrit r250
-linux 2.6.26-2-amd64
-ATI driver 10.2 

the PMK always was >= 120000

Original issue reported on code.google.com by pyrit.lo...@gmail.com on 1 Aug 2010 at 2:17

Blocking: #261

GoogleCodeExporter commented 8 years ago

i did 2 test.

A. calpp 0.87 with old version pyrit-calp-v2b-2 (took from issue #148) the 
PMK/s correctly did 10.000.000 passwrd in 97 seconds.

B. i delete /usr/include/cal and downgrade to calpp.0.86.3 + pyryt-276 but 
problem persist,  10.000.000 passwrd in 122 seconds.

I ask to other people with ATI to install calpp 0.87 + pyrit 276 and run some 
real tests, not only benchmark or benchmark_long and report if they have the 
same issue.

Original comment by pyrit.lo...@gmail.com on 2 Aug 2010 at 7:34

GoogleCodeExporter commented 8 years ago

Could you test v2b-4 with false, false. If it works correctly than we are 
hitting some bottleneck in feeding core code. 

Main difference between v2b-4(false,false) and svn version is the size of data 
sent to gpu ( roughly 3x bigger ). There might be some problem in pyrit which 
causes slowdown when working with such a big data blocks.

You can disable CPU cores - if it "solves" the problem then it's most probably 
the issue with big data blocks. 

You can limit max block size by changing value in cpyrit_calpp.cpp ( line 495 ).
v2b-4 used value ~80000 ( it should be dividable by 4096 ).

Original comment by hazema...@gmail.com on 2 Aug 2010 at 10:25

GoogleCodeExporter commented 8 years ago

Also for v2 div_size=1

Original comment by hazema...@gmail.com on 2 Aug 2010 at 10:30

GoogleCodeExporter commented 8 years ago

I will do tests and report late today.

Original comment by pyrit.lo...@gmail.com on 3 Aug 2010 at 12:29

GoogleCodeExporter commented 8 years ago

hazeman11, here the results.

remove calpp 0.86.3 (delete /usr/local/include/cal directory)
remove pyrit (delete everything inside /usr/local/lib/python2.6/dist-packages/)

install calpp 0.87
install pyrit 276

pyrit benchmark 
#1 82000
#2 41000
#3 700
#4 700
#5 0

then I do real test 3 times, the PMK are 69-73

remove pyrit 276
install pyrit v2-4 (sorry, i dont know where to find v2b-4)
pyrit benchmark 
#1 69000
#2 34000

then I do real test 3 times, the PMK are 99-100 (more stable results)

remove pyrit v2-4
install pyrit 276
configure limit_ncpus = 1

then I do real test 6 times, the PMK are 120-120-78-75-69-80 (first 2 test was 
good but then decrease PMK)

configure limit_ncpus = 2

then I do real test 4 times, the PMK are 90-79-78-81

Ok, I am confused now.
Any suggestion?

Original comment by pyrit.lo...@gmail.com on 3 Aug 2010 at 5:38

GoogleCodeExporter commented 8 years ago

I've uploaded debug modification to svn.
To use it you need to uncomment line 32. Also you need to modify line 76 in 
setup.py
libraries=['crypto','aticalrt','aticalcl','boost_date_time-mt'],
( for you system boost_date_time-mt could have slightly different name , check 
in /usr/lib/ ( libboost_date_timeXXXXXX ). Boost date-time must be installed.

After the modification calpp core will be printing data about time lost for 
data preprocessing - or in other words time when GPU is idle and does nothing.

You can try to play with lines 556...559 in _cpyrit_calpp.cpp
change 

div_size=1;
avg_size=xxxx;
max_size=xxxx; 

xxxx should be multiply of 4096 . It's possible that pyrit has performance 
problems with creating huge data sets required for 5870 to run for 3 secs.

On my cpu ( 2.5ghz ) pyrit can't feed 5850 even during the benchmark - there 
are huge amounts of time lost.

Unfortunately preprocess performance is huge problem of pyrit. Pyrit is written 
in python which has issues with multitasking. Any preprocessing done in python 
( some parts are done in C in computing cores ) is computed sequentially ( so 
even if you have 8 cores only 1 is used for preprocessing ). I think this might 
be problem that we are getting here.

Also some unstable runs you have could be caused by driver problems - I have 
seen this happen on ATI gpu.

Original comment by hazema...@gmail.com on 8 Sep 2010 at 1:29

GoogleCodeExporter commented 8 years ago

hi hazeman11.
My PC has hard disk issue, after I will solve it i will follow you suggestion, 
then I will came back to report you.

Original comment by pyrit.lo...@gmail.com on 10 Sep 2010 at 11:37

GoogleCodeExporter commented 8 years ago

haseman11, note I am working now with r279.

I did as you said me, I had A LOT of follow message:

"No fast enough data preparation for GPU: lost time XXX ms"
"No fast enough data preparation for GPU: Estimated lost time YYY ms"
Where XXX goes from 5 to 22 and YYY goes from 3 to 507. (NOTE: sometime YYY is 
a negative number, as -1, -3, -207)

Then I comment again line 76 and play wit line 556 ... 559 in _cpyrit_calpp.cpp.
here the result:

div_size=1; avg_size=4096; max_size=4096; ---> PSK=103k
div_size=1; avg_size=4096; max_size=8192; ---> PSK=114k
div_size=1; avg_size=8192; max_size=8192; ---> PSK=114k
div_size=1; avg_size=8192; max_size=16384; ---> PSK=118k
div_size=1; avg_size=16384; max_size=16384; ---> PSK=117k
div_size=1; avg_size=16384; max_size=32768; ---> PSK=90k

All in all, I got back at least 118k, but the 122k I had before still missed.

Maybe there are other parameters I can modifiy?

Original comment by pyrit.lo...@gmail.com on 18 Sep 2010 at 11:42

GoogleCodeExporter commented 8 years ago

I'll explain first what those output means
No fast enough data preparation for GPU: lost time XXX ms
  - It's the time between finalising last gpu computations ( GPU could finish computing much earlier - but it's the time when CPU could get back to take care of GPU ) and starting new computations - so any time reported here is when GPU is idle.

No fast enough data preparation for GPU: Estimated lost time YYY ms
  - This one is more tricky. In theory CPU should prepare next data for GPU when GPU is working. After preparing data it should wait for GPU to finish current computations. Typical computation cycle should last for 3 seconds. 
    This communicate is displayed when CPU is waiting for GPU for less then <0.2s . In most cases it means that GPU already has done computations and CPU wasn't fast enough to prepare data for next cycle. The debuging code tries to estimate lost time ( based on GPU speed and current time ) but it isn't always accurate ( so sometimes there are negative values ).

I don't think that modifying any values makes any sense now. It's obvious that 
pyrit engine written in python simply can't handle preparing data fast enough 
for ATI GPU.

Original comment by hazema...@gmail.com on 20 Sep 2010 at 11:44

GoogleCodeExporter commented 8 years ago

my hardware reached 125K, now only 118k. Something, somewhere, is wrong. 
Unfortunalety I have no more the hard disk with configuration (OS, ATI driver, 
ATI stream, pyrit version, etc) that gave me 125k, so, I can't be back to these 
"golden days". (NOTE: 125k was real, not inaccurate value from benchmark)
Because I know my hardware can does 125K i asked you if tere are some different 
parameter to trim.
Anyway, I agree totally with you that lack is in the language of progrmming:  
python was able to manage "slow" GPU as NVIDIA 8800 or ATI 4850 but now with 
bigger GPU python reached is limit. Unfortunately, my lack in programming does 
not allow me to help in porting pyrit from python to C, so I can only wait 
someone more skilled than me will do :(.

Original comment by pyrit.lo...@gmail.com on 22 Sep 2010 at 5:33

GoogleCodeExporter commented 8 years ago

some more info.

I redo with div_size=1; avg_size=8192; max_size=16384; and this time I got 
PMK=119402, better than before. Maybe because PC was just turned on and it is 
fresh :) anyway, che cpu 4 core run at 3.2GHz.

CPU is 3200MHz, 200x16.

Now I overclock CPU at 3500MHz, 200x17.5 (+9,375%) and rerun the same test. The 
result in PMK is the same 119400.

This is strange to me, I suppose if cpu run faster it should compensate the 
inefficency of python instead it is not.

then I overclocked also (both) GPU from 850Mhz to 875Mhz
(aticonfig --od-enable
aticonfig --adapter=0 --od-setclocks=875,1200
aticonfig --adapter=1 --od-setclocks=875,1200)

and run test again: PMK=112044 (worse!)

then I overclocked also (both) ram from 1200Mhz to 1250Mhz
(aticonfig --adapter=0 --od-setclocks=875,1250
aticonfig --adapter=1 --od-setclocks=875,1250)

and run test again: PMK=123457 

then I overclocked agin (both) ram from 1250Mhz to 1300Mhz
(aticonfig --adapter=0 --od-setclocks=875,1300
aticonfig --adapter=1 --od-setclocks=875,1300)

and run test again: PMK=122699

so it seems that bottleneck is the clock of the ram on videocard ahd best speed 
is 1250MHz.

I hope these tests I did can help.

Original comment by pyrit.lo...@gmail.com on 24 Sep 2010 at 5:11

GoogleCodeExporter commented 8 years ago

I guess CPU speed does matter!
After overclocking my CPU from 2.5 to 3.3GHz I noticed a huge improvement, 
especially when using sqlite:// provider. Now I can get almost 110.000 PMK/s 
instead of 90.000.
Okay, it is still far away from the theoretical maximum (140.000 PMK/s) but 
finally I can see the bottleneck.

Original comment by kopierschnitte@googlemail.com on 6 Oct 2010 at 7:08

GoogleCodeExporter commented 8 years ago

My conclusion is that CPU overclock helps in YOUR case. In my case, move from 
3.2Ghz to 3.5GHz does not give speed up (I suspect that starting from 3.2Ghz, 
that CPU speed doesn't matter because of of lack of python and only C could 
give further speed up).
Anyway, take in mind that I don't use sql, I use direct file. So, maybe the 
problem is elsewhere. I suggest you to repeat my same test (pyrit -e test -r 
wpa.cap -i list.txt attack_passthrough) and see if in case of direct file you 
can have a further speed up.

A test that I still not do is to trim the number of CPU involved: I have to try 
1, 2, 3 or 4 because I suspect that too much CPU involved cause bottleneck in 
thread's management. After it, I will post results.

Original comment by pyrit.lo...@gmail.com on 7 Oct 2010 at 12:13

GoogleCodeExporter commented 8 years ago

Okay, interesting point. I will try the attack_passthrough function as soon as 
possible.

When you say, you don't use sql and the overclocking didn't get you any 
performance boost: Don't you have those extremely long delays (1h and more) 
when starting the attack or using the eval command?

I guess that CPU is limiting the GPU-performance as long as you get 100% load 
on each core while "feeding" the GPU ... but I might be wrong ;-)

Original comment by kopierschnitte@googlemail.com on 7 Oct 2010 at 1:20

GoogleCodeExporter commented 8 years ago

In issue 191, Lucas stated that the direct file storage isn't quite effective 
for large sets of passwords. So I guess it would be the best to focus on the 
sql engines for massive amounts of workunits.

Did you already try your above tests using a different storage provider?

Original comment by kopierschnitte@googlemail.com on 7 Oct 2010 at 2:12

GoogleCodeExporter commented 8 years ago

kopierschnitte,
due to my needs, I don't use any kind of *sql database to store password and 
PSK, so I can not aswer to your question, sorry.  More, I never used 'eval' 
command at all.

My "modus operandi" is to have long list of passwords (about 250Million each) 
to create cowpatty file with password+psk.

In case of "test on the fly" I use the above method of attack_passthrough.

Original comment by pyrit.lo...@gmail.com on 7 Oct 2010 at 3:04

GoogleCodeExporter commented 8 years ago

Okay, I understand. For 250M passwords, the file provider shouldn't be a 
problem. And because you don't write the computed PMKs back to the db, you 
don't stress the process too much.

According to your above tests, it seems to be a timing issue between the GPU 
memory clock and the speed, pyrit feeds the GPU. Are you sure, you've got 
better results with your previous linux installation?

Original comment by kopierschnitte@googlemail.com on 7 Oct 2010 at 7:32

GoogleCodeExporter commented 8 years ago

Overall, one thing that i HATE so much is when people ask me "but are you 
sure?" as they suspect I am a complete idiot unable to do my job or to read a 
number on monitor.
To convince you, point your web browser here and read whole the story, 
http://code.google.com/p/pyrit/issues/detail?id=148

Original comment by pyrit.lo...@gmail.com on 7 Oct 2010 at 8:00

GoogleCodeExporter commented 8 years ago

I did tests.

limit_cpus= 0 AND workunit =  75000 time = 335+K sec (this is the default 
setting)
limit_cpus= 1 AND workunit =  75000 time = 335+K sec 
limit_cpus= 2 AND workunit = 150000 time = 335+K sec 
limit_cpus= 2 AND workunit =  75000 time = 335+K sec 
limit_cpus= 3 AND workunit = 150000 time = 335+K sec 
limit_cpus= 3 AND workunit =  75000 time = 334+K sec 
limit_cpus= 3 AND workunit = 300000 time = 333+K sec 
where 0 < K < 1

I can say 'limit_cpus' and 'workunit' are 2 parameters that does not affect the 
result in my case.

Original comment by pyrit.lo...@gmail.com on 7 Oct 2010 at 8:09

GoogleCodeExporter commented 8 years ago

I don't think you are an idiot. Sorry if my post offended you that way. I just 
wanted to know if you've got any idea what the cause could be as it seems we 
are both fighting the same problem.

Original comment by kopierschnitte@googlemail.com on 7 Oct 2010 at 9:27

GoogleCodeExporter commented 8 years ago

Well, to me is impossible to identify where the problem is because there are 
too much variables to take in cosideration.
A. Pyrit get changes weekly 
B. A new catalyst driver every month (and regression seems to be the rule, not 
the exception)
C. SDK change version quite often
D. calpp moved from 0.86.3 to 0.87
E. I upgraded from debian stable 5.0 to debian testing 6.0 because of it is 
needed to use python >= 2.6 and pyton-scapy 2.0 is in debian >= 6.
F. there are some parameters you can play with inside pyrit (limit_cpus, 
avg_size, max_size, etc etc)

It is quite hard to identify where the regression is: bad interaction between 
driver and hardware? bug in pyrit? problem in SDK? regression in calpp? Who 
knows? To me, now the rule is: "when you find the best configuration, 
IMMEDIATELY do a data dump of whole harddisk and save it in safe place". If I 
follow this golden rule months ago, I was able to save my 125K PMK/s.

Anyway, I am confident that to rewrite pyrit in C will reduce/eliminate 
problems; have a look to issue 185 to read discussion about this proposal: 
unfortunately, it seems to be a "low priority" issue :( instead it should be 
(in my opinion) "THE" issue.

Original comment by pyrit.lo...@gmail.com on 7 Oct 2010 at 10:23

GoogleCodeExporter commented 8 years ago

Sorry, I did not understand your workflow correctly. Of course, the workunit 
size shouldn't matter because you don't do anything with the database / storage 
provider.

So I agree that this is "error by design". In your case, the reading and 
preparing of the passwords, the queue management and the GPU-feeding is done in 
a single thread.

How is the CPU utilization (per core) when running attack_passthrough with 
limit_ncpus=1?

Original comment by kopierschnitte@googlemail.com on 8 Oct 2010 at 12:54

GoogleCodeExporter commented 8 years ago

I runned test in "brainless" mode.
There are variables? ok, I run pyrit with different value for each variables an 
I count the time to complete te same task for the divverent values. 
In other words: I don't know if changing value of workunit will geve me speed 
up, but I did te same just to have a more wide range of results.

I dont know hom much is the load of (each) CPU when I run attack_passthrough 
with limit_ncpus=1: i did not check it. In past, I discovered that if I disturb 
PC doing other tasks (top, ps, htop, etc) when pyrit is running, then pyrit get 
speed down. See issue 148 comment 110. Because of it, I am used to do nothing 
else when pyrit is running.

Original comment by pyrit.lo...@gmail.com on 8 Oct 2010 at 2:08

GoogleCodeExporter commented 8 years ago

Okay, forget the workunit value. It doesn't matter in your case. I was asking 
about the CPU utilization because I'm quite sure that this is the limiting 
factor but for now I don't understand why pyrit only reaches the CPU limit when 
using the attack- or batch- but not when using benchmark-command.

Did you adjust your DISPLAY environment setting?
What's your current output of "echo $DISPLAY" on your system?
Did you attach a monitor on each GPU?

Original comment by kopierschnitte@googlemail.com on 14 Oct 2010 at 10:40

GoogleCodeExporter commented 8 years ago

>I don't understand why pyrit only reaches the CPU limit when using the attack- 
or
>batch- but not when using benchmark-command.
I suspect because benckmark is not a REAL work but only a test, so it does not 
really "push to the limit" the CPU: so python can manage a test, but can't 
"push to the limit" the CPU. 

about display: I set "export DISPLAY=:0" in /root/.bashrc: all my activities 
related to pyrit are made with root account, so the DISPLAY variable is correct.
The output of "echo $DISPLAY" is ":0"

Monitor: no, only one monitor (the only one I have) is connectd to primary 
output of primary videocard (HD5870).

Original comment by pyrit.lo...@gmail.com on 14 Oct 2010 at 8:23

GoogleCodeExporter commented 8 years ago

Hmm, but I always thought, pyrit just "replays" sample computations when doing 
benchmarks. Maybe pyrit is calculating random passwords or something like this. 

I still suspect the file i/o subsystem to be our bottleneck on this issue :-(

Regarding the DISPLAY and/or monitor thing: Someone in the ATI forum wrote that 
it could matter...

Original comment by kopierschnitte@googlemail.com on 15 Oct 2010 at 7:13

GoogleCodeExporter commented 8 years ago

about suspect of i/o bottleneck: I use XFS, what filesystem do you use? try 
different filesystem, maybe you will get better i/o bandwith. Or use 2 
different disk from whom to read password and one to store PMK. If you have 
enough ram, you can use /dev/shm to eliminate delay of mechanical hard disk.

about DISPLAY: who say? what exactly did he say? where? about what it could 
matter? please give more and - if possible - sure info

Original comment by pyrit.lo...@gmail.com on 15 Oct 2010 at 7:53

GoogleCodeExporter commented 8 years ago

please take discussions to the mailing list where everyone can find it

Original comment by lukas.l...@gmail.com on 15 Oct 2010 at 8:00

GoogleCodeExporter commented 8 years ago

Issue 198 has been merged into this issue.

Original comment by lukas.l...@gmail.com on 17 Oct 2010 at 12:31

GoogleCodeExporter commented 8 years ago

I'll try to stick on the issue's topic now, sorry.
The information about the DISPLAY: variable can be found here in issue 123 
(comment 44).
The other sources for my last comment are
http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=140633 (for the 
2nd monitor topic) and 
http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=139606 (for the 
DISPLAY: thing).

Regarding the I/O performance, I also thought the HDDs might be a bottleneck, 
but iostat -m 1 is telling me something about 200 IOPS and >2MB/s read/write 
speed at the moment pyrit runs. I suspect that recent SATA drives can easily 
handle more IOPS and I've crosschecked this by doing some simple copy tasks 
(achieving more than 400 IOPS). But I'll try to switch to XFS or btrfs as soon 
as possible.

Maybe there's some way to control how much passwords pyrit holds in memory (= 
size of the ring buffer / queue)...

Original comment by kopierschnitte@googlemail.com on 17 Oct 2010 at 7:26

GoogleCodeExporter commented 8 years ago

Please stop talking about HDD performance or other stuff. You can check if it 
is python's/pyrit's fault by running 2-3 instances of pyrit at the same time. 
On our test machine we get 170KPMK/s on singe pyrit instance and 240-280KPMK/s 
on 3 instances. It's very simple test...

Original comment by mmajchro...@gmail.com on 17 Oct 2010 at 7:30

GoogleCodeExporter commented 8 years ago

mmajchrowicz, ok I see the point. 
Pyrit/python/catalyst should scale in function of hardware, instead is not, and 
your test confirm it.
from my side, if I test one by one the two single cards, HD5870 double the 
power of HD5770, but when i use togheter, the power is not 150% but less.
By the way, a HD5970 should to at least 120K and 4 HD5970 should do at least 
480K but it is not. To me, it is pyrit/python/catalyst that does not scale up 
all that powrfull hardware (or maybe the lack is from driver) Anyway, It is not 
a problem can we can solve but only report.

Original comment by pyrit.lo...@gmail.com on 17 Oct 2010 at 8:27

GoogleCodeExporter commented 8 years ago

We need some python/pyrit magic or full c/c++ rewrite... Many modules are 
already writen in c/c++ so maybe we can put some managment code to external 
c/c++ module since this seems to be the issue ?

Original comment by mmajchro...@gmail.com on 17 Oct 2010 at 8:43

GoogleCodeExporter commented 8 years ago

Okay, let's forget the IO performance but it's still a fact that each GPU needs 
a dedicated CPU core. In your case, this would mean a total of 8 cores. I guess 
that's one reason you don't get 480k PMK/s.

Pyrit's "queue management" is surely another point and explains why several 
instances are giving you slightly higher performance. And exactly that's the 
point we can work on. Hopefully this doesn't involve a complete rewrite. Maybe 
an optimized threading architecture or larger buffers can also help.

How did you measure your results? As far as I've understood this issue, we are 
talking about a huge difference in "benchmark_long" and the main attack 
commands ("batch", "attack_db", etc.).

Original comment by kopierschnitte@googlemail.com on 17 Oct 2010 at 8:59

GoogleCodeExporter commented 8 years ago

ATM I not interested in getting 240k PMK/s on system with 4 cores and 2 HD5970 
running one pyrit instance instead of 2-3 instances. Besides it should be at 
least 280k MPK/s since I am able to up to 75k PMK/s for single core of HD 5970 
and 135-140k for two cores.

Original comment by mmajchro...@gmail.com on 17 Oct 2010 at 9:05

GoogleCodeExporter commented 8 years ago

I am talking about results that I get from benchmark/benchmark_long commands. I 
know I will get less when I run other commands I know that other "parts" of 
pyrit also have impact on it's performance but I don't see a point in trying to 
optimize other parts if I am not able to use at least 80-90% of computing power 
on simple benchmark.

Original comment by mmajchro...@gmail.com on 17 Oct 2010 at 9:09

GoogleCodeExporter commented 8 years ago

I opened this ticket issue because the problem cames when I run REAL test on 
data, not when I run benckmark, I don't see the vantage to investigate on 
benchmark/benchmark_long commands: they are just a bogomips(*) benchmark, not a 
real test on real data that involve also other parts of pyrit.

(*)http://en.wikipedia.org/wiki/BogoMips

Original comment by pyrit.lo...@gmail.com on 18 Oct 2010 at 9:05

GoogleCodeExporter commented 8 years ago

You are completely wrong. If CPU has problems running benchmark alone it's 
obvious it will have even bigger problems calculating PMK's (main part of 
benchmark) and also other stuff (reading passwords, writing results and other 
stuff). On our hardware we also get performance issues when we run REAL test on 
data but since we are not able to get even 50% of computational power when 
running simple benchmark don't you think this should be fixed first? Come on 
guys what is the point of reimplementing other stuff when it's impossible to 
reach full potential on "test" command? Besides CPU performance probably also 
has influence on your results. It is just more visible when you "force" pyrit 
do to other stuff beside calculating PMK's

Original comment by mmajchro...@gmail.com on 18 Oct 2010 at 9:20

GoogleCodeExporter commented 8 years ago

mmajchrowicz, 
ask to yourself WHY there are bigger problems when REAL test is runned: maybe 
it is because the problem is not into benckmark test.

beside, as far as I know, there is an issue on Catalyst driver that does not 
allow to HD5970 to use BOTH GPU, so only one it is use on each HD5970.

from http://pyrit.wordpress.com/2010/08/16/ati-still-sucks/ I report the follow:
"The ATI RadeonTM HD 5970 GPU is currently supported in single-GPU mode only. 
It is recommended users only access the first device on an ATI RadeonTM HD 5970 
GPU for GPU compute."

So, until ATI will fix the issue, a monocore GPU HD5870 will have more power 
than a dualcore GPU HD5970.

HD5870@850Mhz does 82000 PMK/s: 5970 run at 725Mhz so 82000/850*725=70K (one 
core).

So, ath the moment, each HD5970 should do no more than 70K, so 4*70=280K PMK/s, 
that is exaclty what you get from your hardware. For 5970, the main issur is 
inside Catalyst, not inside pyrit.

I got clock from here: 
http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units

Original comment by pyrit.lo...@gmail.com on 18 Oct 2010 at 10:25

GoogleCodeExporter commented 8 years ago

You are again "almost" right and as a result completely wrong. First of all CPU 
cycles aren't magical. If pyrit/python doesn't "have" enough of them for just 
benchmarking it is obvious that it will "have" even less if you ask it to do 
additional stuff.
Secondly ATI lies :) I don't know why but they do. It's probably because it 
didn't work with previous version of driver but it works now flawlessly but 
they just didn't change the note on their web page. As I have mentioned before 
(when I have ONLY ONE HD 5970 in my machine) I get 75k PMK/s for single core 
but when I use BOTH CORES I get around 135k-140k PMK/s so second core really 
works. Also lukas mentiones our configuration 
(http://pyrit.wordpress.com/2010/10/07/pyrit-on-4x-radeon-hd-5970/) as sample 
of using multiple GPUs so that's really not the issue.

Original comment by mmajchro...@gmail.com on 18 Oct 2010 at 10:35

GoogleCodeExporter commented 8 years ago

mmajchrowicz, 
Question: if you run 8 pyrit session (one for each GPU avaiable) how much 
aggregate PMK do you get? Or maybe in this case the limit is the CPU because 
pyrit cannot manage 2 GPU with only 1 CPU? (I suppose you have a real quad cure 
CPU, not Hyper Threading)

Original comment by pyrit.lo...@gmail.com on 18 Oct 2010 at 10:48

GoogleCodeExporter commented 8 years ago

Yes we have real quad and forget that I have mentioned that we have 8 GPUs. 
Let's assume we only have 4 GPUs (two HD 5970) so according to your opinion 
everything should work fine and we should get maximum power but we get 
something like this:
1 pyrit instance - 170k PMK/s
2 pyrit instances - 190-200k PMK/s
3 pyrit instances - 220-230k PMK/s
Besides pyrit is not optimized for running multiple instances I just mention 
because if you run multiple instances every python/pyrit process "gets" another 
core and as a result CPU power used for PMK computation management is higher. 
If you say that you were able to get 120k PMK/s with pyrit I wouldn't be 
surprised that you may be able to get 130-140k PMK/s once pyrit code managment 
is fixed/rewritten. The "regression" that you have noticed is probably because 
more features or more complex code is used "on CPU side". The point is to 
"force" pyrit (if it is possible at all with python) to better utilize power of 
multi-core CPUs. I believe this is the foundation of our problems and my tests 
prove it.

Original comment by mmajchro...@gmail.com on 18 Oct 2010 at 10:57

GoogleCodeExporter commented 8 years ago

I will do tests with 2 pyrit istances later today, than I will report here the 
results.

Original comment by pyrit.lo...@gmail.com on 18 Oct 2010 at 11:16

GoogleCodeExporter commented 8 years ago

mmajchrowicz, I totally agree with you: If the cpu(-core) isn't capable of 
doing a benchmark run at maximum possible speed, it's impossible to get near to 
this when running the main attack functions.

The dual-GPU issue really seems to be fixed a few driver releases ago. You can 
easily check this be looking at aticonfig's output. On my system, I get around 
90% GPU utilization on both 5970-cores. But only during benchmarks ... every 
other function results in highly instable values between 20 and 70%. So it's 
obvious, the GPU is waiting for input.

I suppose, when you run multiple instances, the instances are simply taking up 
the remaining (tiny amount of) processor idle time. But python itself should be 
able to handle multiple threads/cores...

Again, it would be great if someone with python knowledge could take a look 
into the code to see if there are any hardcoded queue-lengths or something else 
that prevents the 5870/5970 from being constantly(!) busy.

Original comment by kopierschnitte@googlemail.com on 18 Oct 2010 at 2:34

GoogleCodeExporter commented 8 years ago

Maybe I'll explain few things.

First of all - difference between benchmark and normal work.

Code execution for benchmark looks like this
1. Use random to generate data and put to the queue ( this is done in loop )
2. Take data from queue and call core function to process ( Core class )
3. Process data ( this is done in plugin core - opencl, cal++, cpu )

For normal attack it looks this way
1. Take data from storage ( files, database ) and put to the queue
2. Take data from queue and call core function to process ( Core class )
3. Process data ( this is done in C plugin  - opencl, cal++, cpu )

So we see that point 2 & 3 are exactly the same for both cases. Only point 1 is 
different - and usually taking data from storage will be more cpu intensive. 

Now the question of multiple instances of pyrit and why it improves performance 
- This is caused by big problem of python - Global Interpreter Lock (GIL). Any 
code in python must lock GIL very often ( few hundred ops ) - for multithreaded 
python code it implies that only ONE THREAD will be executing at the moment. So 
python code is effectively using only ONE CORE. In pyrit this problem was 
temporally "solved" by putting computations into C cores. But this "solution" 
isn't working anymore for high performance GPUs.
So running multiple instances of pyrit allows to use more than one CPU core for 
data preparation.

The cpu bottleneck problem in pyrit could be partially solved by switching from 
thread library to process library - this isn't so hard to do and doesn't take 
so much work. But I think that only C/C++ rewrite would be proper solution to 
current pyrit problems.

I won't start making so big changes as I'm only responsible for CAL++ part of 
pyrit. 
Lucas must decide what to do with pyrit.

Original comment by hazema...@gmail.com on 18 Oct 2010 at 4:19

GoogleCodeExporter commented 8 years ago

I did test as promised.
If I run 2 instances of pyrit, I got LESS PMK/s than run a single instance.
Single instance: about 120K
double instance: 43K + 35K.

Original comment by pyrit.lo...@gmail.com on 18 Oct 2010 at 7:09

GoogleCodeExporter commented 8 years ago

Unfortunately it doesn't prove anything... In order to get better results with 
multiple pyrit instances you must have big difference between what you are 
getting and what you are "supposed to get". You must take into consideration 
that in such scenarios pyrits are fighting for GPU :)

Original comment by mmajchro...@gmail.com on 18 Oct 2010 at 7:23

GoogleCodeExporter commented 8 years ago

Thanks for this detailed explanation. So it's really up to Lucas to decide 
which way pyrit will take.

In my case, multiple instances are also decreasing the overall performance. But 
I guess if the speed of one cpu core is sufficient for feeding exactly one gpu 
core, no improvement is expected. In this case, there is no kind of race 
condition. When file (or password) i/o is taking place, this might be different.

Just an idea: Could the multiprocessing interface be helpful?
-> http://docs.python.org/library/multiprocessing.html

Original comment by kopierschnitte@googlemail.com on 18 Oct 2010 at 8:30

GoogleCodeExporter commented 8 years ago

kopierschnitte: Your test is flawed as both instances of Pyrit battle for the 
same GPUs. Overhead is killing performance as they get in each others ways.

Python will always be an integral part of Pyrit (see the name).

The latency(!) to supply work to the GPU is now a bigger problem than once 
thought as GPUs are much faster than expected. Remember: 10.000 PMKs on a 
single, high class GPU was to be a big number just a few months ago. Now we are 
talking about 280.000 PMKs on 8 GPUs...

The GIL is not a problem and (forgive me for saying) not well understood in 
it's consequences. Other threading-libraries are not considered; Pyrit will 
always rely on CPython's threading. The multiprocessing-module is not 
considered as it is not available in Python 2.5; it is also not very stable and 
caused numerous problems in other projects.

What needs to be done is to implement a kind of triple-buffering between CPU 
and GPU. In this approach, the thread that steers the GPU runs for almost all 
of it's lifetime without ever the need to acquire the GIL. So, there is a 
solution for those top 10% users of Pyrit with high-end hardware already in my 
mind :-)
Again, my main problem is time (full time job and diploma thesis) and 
hardware-access that can actually provide the workload (i've one six years old 
computer with a 4850 and a macbook pro with a 9800m).

Original comment by lukas.l...@gmail.com on 18 Oct 2010 at 9:16

GoogleCodeExporter commented 8 years ago

Original comment by lukas.l...@gmail.com on 18 Oct 2010 at 9:19

Changed title: Pyrit does not scale well for multiple GPUs
Changed state: Accepted

dwaaan / pyrit

Pyrit does not scale well for multiple GPUs #173