Open GoogleCodeExporter opened 8 years ago
> What needs to be done is to implement a kind of triple-buffering between CPU
and GPU. In this approach, the thread that steers the GPU runs for almost all
of it's lifetime without ever the need to acquire the GIL.
The problem isn't solvable by buffering - at the moment CAL++ core supports
n-buffering ( without need to acquire GIL lock for buffered data ). And triple
buffering was tested during v3 development ( as well as quadra and more :) ).
The problem is that main pyrit classes can't "produce" data fast enough. This
isn't issue of latency. Also multiple cards aren't required to get this cpu
bottleneck. 2.5ghz pentium dual cpu can't feed single 5850 card during
benchmark.
The bottleneck is located in the "feed queue - queue management - taking data
from queue" code. All of this processing is done in python and in different
threads. So if you want to stay with current "design decisions" I really doubt
it can be solved.
> So, there is a solution for those top 10% users of Pyrit with high-end
hardware already in my mind :-)
In a year or two it won't be 10% but 90%. So in future it might be deal breaker
for pyrit. And I really doubt that we have solution.
Original comment by hazema...@gmail.com
on 19 Oct 2010 at 12:12
Someone can explain me what GIL means? thanks.
Original comment by pyrit.lo...@gmail.com
on 19 Oct 2010 at 8:21
Python is an interpreted language. That means the code written by the
programmer is translated into an intermediate state and then interpreted under
the oversight of an almighty interpreter-overlord. The overlord is omniscient,
decides when an object's lifetime has ended, allocates and frees memory etc.
Things become more complicated when threads are involved. Threads may cause all
sorts of problems as they can modify objects concurrently (e.g. thread 1 adding
an object to a list of objects which thread 2 is currently deleting).
The programmers of CPython (the main implementation of Python, written in C)
had to make sure that the interpreter-overlord always stays consistent. This is
especially true because threads in CPython are real OS-threads which are almost
not managed any further by the interpreter. The way to solve this was to
introduce the Global Interpreter Lock (GIL): It ensures that only one thread at
a time can run interpreted code or access the CPython-API. That way multiple
threads can't get into each others way, corrupting the state of the other.
Everyone has to wait to get in line.
The downside is that code written in Python can't execute CPU-bound-code in
multiple threads. Code written in C and called from Python however can - as
long as it does not touch the CPython-API.
The GIL has been the source of a lot of controversy and statements like "you
can't do threads in python". However the GIL is a very, very strong solution to
a very complicated problem. On the other hand it is not free from drawbacks and
unwanted side-effects. The way the lock (basically managed by the OS/glibc),
threads (OS/glibc), signals (OS/glibc) and CPython itself work together can
cause problems like priority-inversion.
Original comment by lukas.l...@gmail.com
on 19 Oct 2010 at 9:08
sorry, I read with attention previous post (#45) and now I know what GIL means.
Original comment by pyrit.lo...@gmail.com
on 19 Oct 2010 at 9:12
I have not a deep knowledge in thread programming, so high probably I will tell
a wrong thing but...
What about to force (don't ask me why) pyrit to do single task "one CPU one
GPU"?
I mean, I have a 4 core CPU and 2 videocards mono-GPU: ok, I will run pyrit
with some parameter as:
pyrit --cpu=0 --gpu=0 -e test (and so on) &
pyrit --cpu=1 --gpu=1 -e test (and so on) &
pyrit --cpu=2 --gpu=none -e test (and so on) &
pyrit --cpu=3 --gpu=none -e test (and so on) &
in this way I will force run 4 istance of pyrit and each of them should run at
100% without disturb (to be disturbed) the (from) other tasks.
In this way there will be workaround. Of course the bad side is that I have to
split my dataset to avoid that different tasks work on same data. More, in case
of 4 HD5970 there will be 8 GPU but only 4 CPU so this trick will use only 50%
of avaiable hardware power... but.. hey, it is just and idea :)
Original comment by pyrit.lo...@gmail.com
on 19 Oct 2010 at 9:50
@comment 55 - It is some kind of solution, but it won't help in cases where CPU
can't feed 1 GPU ( like for me now ). And it's possible that for new GPU
generation ( like 2x faster ) no CPU will be fast enough to feed it. So some
fundamental changes are required.
Original comment by hazema...@gmail.com
on 19 Oct 2010 at 3:44
I agree completely. Moreover it is very common that people update their GPUs
more often than CPUs. As a result someone may have a "old" CPU (multi-core) and
"top class" GPU. So it is very plausible that pyrit would need two cores fro
one GPU in such case. hazeman11 can't you "add" some kind of benchmark cal++
plugin and put it in a seperate binary? So that we could measure (in PMK/s)
difference between pyrit results and "possible c/c++ implementation"?
Original comment by mmajchro...@gmail.com
on 19 Oct 2010 at 4:41
Issue 185 has been merged into this issue.
Original comment by lukas.l...@gmail.com
on 19 Oct 2010 at 4:59
#57, good suggestion. It may bring new ideas to Pyrit's main code-tree.
Original comment by lukas.l...@gmail.com
on 19 Oct 2010 at 5:24
> hazeman11 can't you "add" some kind of benchmark cal++ plugin and put it in a
seperate binary?
I'm thinking about it for 2-3 weeks now :). But didn't have much time to do it.
I also think about making "null computing core" ( with infinite speed :) ) for
pyrit. This would be good for estimating max performance of pyrit preprocessing
part.
But I'll try to do something in a week or two :).
Original comment by hazema...@gmail.com
on 19 Oct 2010 at 5:44
#60, the "null computing core" is already there: cpyrit_null
It does exactly what you want it to do (nothing) and can easily be extended to
simulate real work (e.g. yielding in it's solve()-function to allow multiple
instances of it running at a defined "speed" per instance, putting stress on
the GIL).
You'll have to take out some safety-locks in CPyrit in order to initialize it
as it corrupts your database right away :-)
Original comment by lukas.l...@gmail.com
on 19 Oct 2010 at 6:22
@comment 56:
At the moment, for pyrit/calpp the fastest monocore GPU is on HD5870. As I
reported in past, HD5870 (1600 Shader Processor@850Mhz) double the power of
HD5770 (800 Shader Processor@850Mhz), so the x2 PMK/s gain is lineat with x2
SP: in other words, now pyrit still able to do his work (when it run on a
sigle task, single GPU).
As far as I know, ATI has not plans to create a 3200SP@850Mhz GPU, at least
before they move to 28nm technology (it means at least 12-18 months).
I don't mean "pyrit is perfect, no change required", but I want to say: "there
is time to trick/patch pyrit as momentary solution and to re-think all the
structcure of pyrit from the root for future x2 power GPU".
Of course, it is up to lukas to decide the way to follow.
Original comment by pyrit.lo...@gmail.com
on 20 Oct 2010 at 8:25
ATI 69xx cards are supposed to be available in <2 months. At the moment 6870
with ~1100 shaders is faster then 5850 with 1440 shaders. So I'm not sure if we
have 12-18 months before 2x speed up.
Original comment by hazema...@gmail.com
on 20 Oct 2010 at 12:05
I agree with hazeman11 but let's assume that pyrit.over is right. It would mean
that for MONOCORE GPUs pyrit's architecture doesn't have to be changed in order
to use their computational power. As a result we would have program that
supports multiple GPUs, supports network clients works well only for single GPU
configurations... I thought that pyrit is developed to use full computational
power of the system (CPUs,GPUs) to calculate PMKs... Moreover I believe it is
just the begining. More and more people that use pyrit will notice low (or as
in our example even lack) increase of brute force speed after buying additional
GPUs. There will be a lot of complaining, lack of understanding of pyrit/python
nature and so on. I my opinion (of course if hazeman11 benchmark confirms it)
it is last moment to change current architecture. Lukas isn't it possible to
leave pyrit as it is but move only "PMKs managment" to some C module?
Original comment by mmajchro...@gmail.com
on 20 Oct 2010 at 3:19
To work on HD4850 is not the best hardware to see the structural limits of
pyrit. I think time is came for lukas to open a paypal account, so we can give
money to allow him to buy a couple of high level ATI cards. He deserves them:
donation of 10 Euros will not kill noone of us....
Original comment by pyrit.lo...@gmail.com
on 20 Oct 2010 at 3:29
I am willing to perform tests on your hardware. I know lukas was interested but
we didn't quite discuss any details yet. Anyway I am willing to help :)
Original comment by mmajchro...@gmail.com
on 20 Oct 2010 at 3:32
Sorry I mean "our hardware" not your ;) Typo :)
Original comment by mmajchro...@gmail.com
on 20 Oct 2010 at 3:33
@63: I learned that the vaule of "if", "maybe" "supposed" and so on are less
than zero.
Do you have tested pyrit on these 6870 with pyrit or you just read some web
site? I read some web site's report, but I dont trust at all (I still remember
all the fucking hype they did for FERMI...)
Original comment by pyrit.lo...@gmail.com
on 20 Oct 2010 at 4:11
There should not be any dogmas involved in free software and Pyrit is no
different. For any open-source project of a certain size the time comes when it
grows out of the reach of it's original developer to oversight all aspects of
design and implementation. For any contributor, just as for myself, this may
however involve developing "into the blue" and work which may or may not end up
as a solution in Pyrit's source-tree. This is no different from any other
non-trivial open-source project.
The unquestionable core about what I created and called "Pyrit" are Python,
"free as in freedom", a (aspired) quality of the code and constraints in the
conflict between de-facto being a "hacking tool" and a technological project
(see the second clause from the bottom on the main page). Within these
boundaries, I'm perfectly willing to discuss and accept changes and new
developments.
The bottom line is: Pyrit needs your suggestions; it is an open project. But we
also need horsepower on the road with people being able to outline designs and
writing actual code. I neither can do all the "thinking" nor all the coding on
my own. This is especially true because I'm perfectly able to be proven just
wrong about things!
@pyrit.lover: Accepting donations is a difficult topic for free (as in freedom)
open-source projects. During my time with Pyrit, I've already turned down
several offers of donatations or paid, specific work on it. Money is a game
changer that - once involved - gives a completely different taste to everything
here. Right now I'm perfectly able to accept, turn down or just ignore all
contributions made to the project. Accepting money (or anything of value) would
change that while I'd like to keep it as it is (especially the "ignore" part
:-)).
Original comment by lukas.l...@gmail.com
on 20 Oct 2010 at 7:22
well, more or less pyrit is reaching is final limit, because of lack in python
or because it is interpretated language and so on. As reported in posts, we see
hardware is growing and pyrit will not be able to serve it. Moving to pure C it
seems not to be the path to follow.... so, what else? What is the plan? I ask
this because I am worry this software will be not able to grow... I am not
complain, I feel as uncle that worres because his nephew does not study enough
at school, but he wishes nephew will got Nobel for Medicine in future.
Original comment by pyrit.lo...@gmail.com
on 27 Oct 2010 at 4:50
Issue 208 has been merged into this issue.
Original comment by hazema...@gmail.com
on 20 Nov 2010 at 1:50
Wouldn't it be a temporary solution for people having twice the amount of
physical cpu cores than gpu cores to split the computing/preparation task? In
this case, 4 cpu cores would be used to handle 2 highend gpu cores...
Original comment by kopierschnitte@googlemail.com
on 20 Nov 2010 at 8:50
Can we sort out every kind of file-i/o bottlenecks in this issue? I know, this
has been questioned before but I have done another set of tests with taking a
look at iostat and I've noticed over 5000tps (in peaks above 7000). I dont
think, regular SATA HDDs could handle this heavy load...
In addition, there's a high iowait% while running the batch command.
Original comment by kopierschnitte@googlemail.com
on 28 Nov 2010 at 7:09
@73 Just a suggestion: did you tray to put souce data in /dev/shm? It is the
ram disk, an nothing could be faster that it.
Original comment by pyrit.lo...@gmail.com
on 29 Nov 2010 at 1:40
About speed up pyrit: I see there is a
http://morepypy.blogspot.com/2010/11/pypy-14-ouroboros-in-practice.html
Maybe we can can test with pyrit to see if it is possible to get better
performances than python2.6.
Original comment by pyrit.lo...@gmail.com
on 29 Nov 2010 at 1:42
@74: Ok, done that ... No improvments spotted :-(
Original comment by kopierschnitte@googlemail.com
on 1 Dec 2010 at 5:22
I don't know how many times do I have to repeat myself. The problem is with
pyrit's core (lack of multi-core support)... That's main reason for all
performance problems. Of course you may minimize the impact by "speeding up"
other parts of pyrit but that's not the solution. On our test environment we
have checked different configuration (hardware and software) and "scalling
problem" is the main issue .
Original comment by mmajchro...@gmail.com
on 2 Dec 2010 at 12:15
I've always read and understood your comments and yes, I'm also aware of the
performance issues caused by Python. I was only reporting about a high IO-load
when using pyrit with a "real" database instead of synthetic benchmarks. Under
different circumstances I would really suspect that a 20% iowait state would
slow things down dramatically but this wasn't the cause in this case.
Original comment by kopierschnitte@googlemail.com
on 2 Dec 2010 at 4:02
has somebody done any last tests? Any improvements with new pyrit 0.3.0, kernel
2.6.33-35 and ati drivers 10.11?
Original comment by elec...@gmail.com
on 9 Dec 2010 at 3:06
Tested it with kernel 2.6.35-26 with ati 10.11 and pyrit 0.4.0 svn ... no
improvements found (as expected). I am still getting 140k PMKs during benchmark
and 70k under "real" conditions. But, if you follow the entire thread, neither
the kernel nor the ati drivers will fix this problem.
Original comment by kopierschnitte@googlemail.com
on 9 Dec 2010 at 5:45
Out of intrest what is the performance drop if using cowpatty passthrough?
Original comment by james0p0...@googlemail.com
on 13 Dec 2010 at 5:07
As I'm using passthrough (but without cowpatty), the performance drop is
exactly 50%. For me, it doesn't matter if I use passthrough or attack_db.
Original comment by kopierschnitte@googlemail.com
on 13 Dec 2010 at 5:25
As I'm using passthrough (but without cowpatty), the performance drop is
exactly 50%. For me, it doesn't matter if I use passthrough or attack_db.
Original comment by kopierschnitte@googlemail.com
on 13 Dec 2010 at 5:26
@kopierschnitte
try to set the limit_ncpus to the same number of gpu you have.
I mean, if you have a CPU 4 core and a HD5970 (that has 2 gpu)
then set limit_ncpus = 2.
This should avoid pyrit to run 4 threads for the 4 cpu you have.
Please try and report if you have better results.
Original comment by pyrit.lo...@gmail.com
on 10 Jan 2011 at 11:21
Yes, I have done this a few months ago and the "real" results (not benchmark)
are a little bit better when limiting the number of cpu cores to 2 (or even
1!). Currently, I get ~70k PMK/s without limiting and ~85k PMK/s with limiting
the number of cores to 1. When using limit_ncpus = 1 I get a few thousand PMKs
less.
But at the moment, I feel like the speed varies from day to day :-(
Original comment by kopierschnitte@googlemail.com
on 10 Jan 2011 at 6:57
hi. how i could improve my benchmark. on evga680i+intel q6600+hd5970=
Computed 64974.07 PMKs/s total.
#1: 'CAL++ Device #1 'ATI CYPRESS'': 33718.4 PMKs/s (RTT 3.4)
#2: 'CAL++ Device #2 'ATI CYPRESS'': 36587.5 PMKs/s (RTT 2.5)
#3: 'CPU-Core (SSE2)': 359.0 PMKs/s (RTT 3.4)
#4: 'CPU-Core (SSE2)': 393.4 PMKs/s (RTT 3.3)
can someone write some tips like.. run without graphical interface, run on x64 linux, overclock cpu or gpu or... limit cpus.. btw where i can to do that limit_ncpus? i dont want to brake record :) i just feel i can get more from my pc.
pc. to setup pyrit i used this tutorial:
http://www.backtrack-linux.org/forums/backtrack-howtos/33227-ati-driver-|-stream
-sdk-2-2-opencl-1-1-|-cal-|-cpyrit_calpp-|.html
tnx for your time.
Original comment by minde.pi...@gmail.com
on 25 Jan 2011 at 3:03
Sometime ago I have noticed that the bigger buffer size equals slower
execution. Most probably python have problems with huge amount of memory
allocations. I think that it could be possible speed up pyrit by looking
carefully at the queue management code. Maybe different data structures would
reduce burden on CPU. But the python code is Lukas baby so I won't do anything
there.
For now I'm informing that calpp core uses by default smaller buffer size. So
don't be surprised to see RTT around value 1.0. Also in new version there are
small improvements which should give some speedup.
Original comment by hazema...@gmail.com
on 4 Feb 2011 at 3:53
My idea is to completely change the way we do scheduling to the following, a
generalization of how the CALPP-plugin operates:
- The main scheduler class is a thread on its own. The class takes passwords to
a queue and activly sends portions of that to the devices by calling an
enqueue-function on the device-class. Notice that the current design is a
pull-layout where the devices call the scheduler; the new design is a
push-layout where the devices get activly called.
- The device-class has an internal queue that is protected by it's own lock.
The enqueue-function prepares a set of passwords by doing the first round of
HMAC on the CPU and transfering the result to the device-memory. These prepared
workunits are kept in the device-queue.
- Every device has it's own thread that let's go of the GIL, locks on the
queue, gets workunits and can immediately execute the kernel. The device-thread
does not have to re-acquire the GIL as long as it's queue is filled.
- The scheduler polls the device from time to to time (e.g. every 100ms) to get
back finished workunits, reconstructs the original workunit-layout (e.g. one
block of a thousand passwords may have been spread over several devices) and
returns the result on a call to it's dequeue-function. The polling does in fact
introduce some lag; we can handle this however as the device keeps executing
during that time.
This approach solves the two problem that I think the current design has:
First, the preperation of a workunit and transfer from and to the device can be
truly coalesced with execution time on the GPU. Second, acquiring the GIL
introduces an uncertain amount of lag until it becomes available. The new
layout allows the device-thread to let go of the GIL and basically only
re-acquire it when it runs out of work and needs to return to the python
interpreter when shutting down. The time it takes to cycle execution of the
kernel is basically zero.
Original comment by lukas.l...@gmail.com
on 4 Feb 2011 at 8:08
I was checking oclHashcat on our machine. If understand correctly to compute
single PMK we have to perform 4096 SHA1 computations? oclHashcat was able to
compute from 8,000M SHA1/s to 17,000M SHA1/s. Even if computing single PMK
requires 8*4096 SHA calculations I still should be able to get arround 244PMK/s
Original comment by mmajchro...@gmail.com
on 9 Feb 2011 at 8:17
It is acutally 16.384 rounds of SHA1 per key. You are very welcome to supply a
better SHA1-kernel and solve issue 66
Original comment by lukas.l...@gmail.com
on 9 Feb 2011 at 8:27
[deleted comment]
hat is the point of optymizing kernel source code if on my machine pyrit is
unable to fully utilize current one? I have no actual way of testing it :)
Original comment by mmajchro...@gmail.com
on 12 Feb 2011 at 9:32
@comment 88
I'm not saying that changes you want to make aren't good. It's probably better
design for task pyrit is now facing. But I don't think it will solve the
problem. All your changes are designed mostly to solve scalability issues for
the multi-core system.
The really big issue is that queue management code is simply too slow ( or main
scheduler class in new design ). And the problem exist for even for 1 core
system.
I've done some test with pseudo-null core ( striped calpp core ).
fully null core ( just taking data from python stuctures )
On 2.5Ghz pentium pyrit achived 440K pmk/s.
partial null core ( taking data + initial data preparation done on CPU )
On 2.5Ghz pentium pyrit achived 150K pmk/s.
Changing number of null cores doesn't change anything - pyrit can achive always
the same speed. Number of CPU cores used is also always the same ( only 1 core
used - other cpu cores are idle ) - this shows that GIL lock is an issue for
pyrit.
So in ideal case ( no data preprocessing on CPU ) - pyrit can do 440K pmk/s -
as password can have max length 64B it gives <28MB/s. This is really bad value.
I think that queue handling code and data structures need big redesign.
So back to the changes you proposed. If you just enclose current queue handling
code to it's own independent ( without GIL lock issues ) thread you can expect
~3x speedup. I don't think it's enough to make pyrit future-proof.
Original comment by hazema...@gmail.com
on 12 Feb 2011 at 10:17
we can probably do much better than that using better data structures. There is
to much cut/paste and casting going on; also, we don't need every password to
live as it's own object but can use a more optimized container object. All of
this, however, is a secondary task.
Also remember that the time on the gpu and time on the cpu are truly
independent with this design. If the gpu has three seconds of time before it
needs the cpu again, we can actually handle 440*3 (to stay with your example).
For going beyond that, I think Pyrit's database / processing design needs a
completely different layout
Original comment by lukas.l...@gmail.com
on 12 Feb 2011 at 10:58
> we can probably do much better than that using better data structures. There
is to much cut/paste and casting going on; also, we don't need every password
to live as it's own object but can use a more optimized container object. All
of this, however, is a secondary task.
For me it's not secondary task. It's primary. For me new architecture must be
able to sustain new GPU's that will be available quite soon. Also it should
allow to squeeze all the performance out of current GPU. The changes you
propose won't cut it.
> Also remember that the time on the gpu and time on the cpu are truly
independent with this design. If the gpu has three seconds of time before it
needs the cpu again, we can actually handle 440*3 (to stay with your example).
First of all they aren't. GPU drivers require quite a lot of CPU cycles ( at
least ATI drivers ). So the final performance is much lower than that. Beside
the ~3x I'm talking about isn't 440*3 but the 150*3. If you do some analysis
you will see that new design can achieve only as much as current design with
fully null core. And 440K pmk/s as limit to pyrit isn't too much. Also if you
include other losses ( due to drivers , etc ) it's quite obvious that new
design will achieve much less than 440K. And like is said imho it's really not
enough to make pyrit future-proof.
Original comment by hazema...@gmail.com
on 13 Feb 2011 at 12:34
I'm testing some changes to the preprocessing loop in computing cores. At the
moment I see ~180% of CPU usage with 2 computing cores. This "fix" also allows
some ( not optimal ) scaling with increasing number of CPU cores.
Current preprocessing loop in all computing cores looks like this
start of core solve function/block all python threads
while data available do
take data from python
openssl computations ( really time consuming )
done
unblock threads
start gpu/cpu computations
end of solve
The version I'm testing now is
start of core solve function/block all python threads
while data available do
take N data from python to C array
unblock python threads
do N openssl computations from C array
block python threads
done
unblock threads
start gpu/cpu computations
end of solve
The problem for now is selection of N. I'm achieving good results with N>10000
( acquiring GIL is obviously time consuming ). But I'm not sure if so big value
will fit all CPUs.
This change also solves the "problem" of big buffers. In current solution
preprocessing of huge buffers for fast GPUs blocks queue management/data
gathering thread for too long - causing some strange interaction which
translates into reduced performance.
Original comment by hazema...@gmail.com
on 13 Feb 2011 at 11:35
Have you finished your pure C/C++ cal benchmark? We were discussing it some
time ago. In that way we would be able to just check on few machines how much
faster pyrit needs to be...
Original comment by mmajchro...@gmail.com
on 13 Feb 2011 at 12:21
Hi.
I have prepared two versions of SHA1. Both are based on pyrits one. I was
testing them using my benchmark (calculating 5 times of 633328 hashes). Kernel
bak_sha1_normal.cl was 25% faster on my machine then original one whereas
bak_sha1_int4.cl was 33% faster. The kernels are modified for benchmarking
purposes so they will not work out of the box with pyrit. Anyway wanted to show
you my ideas. Maybe someone will make them even faster :)
Original comment by mmajchro...@gmail.com
on 21 Apr 2011 at 9:01
Attachments:
Guys we MUST fix this issue to make pyrit a powerful tool in the future. Forget
about legacy devices and the like. People who do pyrit will most likely get
themselves a nice 6990 and go from there.
I also suggest leaving donations for Lucas so he can work with this expensive
hardware too.
PERFORMANCE MUST SCALE ON HIGH END HARDWARE !
DONATIONS MUST BE ACCEPTED !
Thank you very much !
Original comment by jukanma...@gmail.com
on 10 May 2011 at 12:46
Guys, there are some news?
This problem start to be heavy and pyrit has not update in last 6 months: maybe
it is ongoin a massive revrite of thge code? Plese inform us.
Original comment by pyrit.lo...@gmail.com
on 4 Nov 2011 at 2:58
Original issue reported on code.google.com by
pyrit.lo...@gmail.com
on 1 Aug 2010 at 2:17