dansarie / sboxgates

Program for finding low gate count implementations of S-boxes.
GNU General Public License v3.0
36 stars 18 forks source link

MPI run fails assertion when running with -l #2

Closed aczid closed 5 years ago

aczid commented 6 years ago

Hi Marcus, thank you very much for this nice tool and the associated research. I was able to successfully run the regular, non-LUT, non-MPI mode of the tool to completion on my laptop. But when I gave the MPI version a try to search for LUT-based solutions as in your example, I ran into a failed assertion almost immediately.

Here is an output log: https://pastebin.com/raw/qzXHzYe5

I hope you'll find the time to take a look at what's going wrong here. PS congrats on your graduation!

dansarie commented 6 years ago

Hi! This is indeed a bug and I'll take a look at it as soon as possible. A quick workaround might be to try another number of MPI nodes with the -n flag to mpirun.

Be aware that the MPI based LUT search has very high complexity and will probably take weeks or more to finish on a laptop. (I used 1024 cores on a high performance cluster for my thesis research and it still took hours.) The non-MPI LUT search has lower complexity since it only tries to add one LUT at a time. I realize this is not apparent from the documentation.

aczid commented 6 years ago

Thanks for the quick reply! Good to hear it's at least a known bug.

That's a fair warning, I noticed LUT generation took about a week on my laptop to finish bit 0. I have received access to a 128-core power8 machine, and have removed your x86intrin.h dependency (albeit in a rather hackish way) to get it to build there. At first I was worried my own changes were causing the failed assertion, but the non-LUT version seems to work okay on both platforms.

About your suggestion: on this 'big' machine it can sometimes take about an hour for the failed assertion to occur. I don't really feel confident in just trying to vary N randomly. Do you maybe have some more intuition about this?

dansarie commented 6 years ago

The crash is related to message passing between MPI instances, so it should (hopefully) not be affected by your changes.

When I suggested the workaround, I assumed that you were using a computer with a small number of cores. On a system with more cores testing another number of instances will probably only cause a random change of when the bug appears.

dansarie commented 6 years ago

Was (finally) going to take a serious look at this today, but the pastebin has unfortunately expired. Could you please post a new output log?

aczid commented 6 years ago

Hey, thanks for taking the time. Sorry for it disappearing on you like that. Here's a fresh pastie that shouldn't expire: https://pastebin.com/raw/3h5AvFKa Then again I assume you'll have probably reproduced the behavior by now.

aczid commented 6 years ago

These comments could be made into separate issues, but I don't want to clutter your issues list. I was hoping to fix this issue myself but the few times I've built up the courage I eventually got discouraged through the sheer complexity of what I was reading, despite your good effort at commenting what's going on. I'm afraid I haven't managed to fully read through the paper yet, but instead tried to follow along in your code. In the mean time I've had some more ideas and was wondering what your thoughts are.

As I was looking over the code again it occurred to me that it could really be helped in terms of maintainability by simplifying the high level design/layout of the code. Inlining, unrolling and ifdef hacks might help to give a slight extra boost as a micro-optimization, but to keep the software usable in the long term it would in my opinion be best to instead modularize the code more so that its working parts can be more clearly understood while reading it.

I think a more modular version of the software could also be applied to compute other functions than 8-to-8 s-boxes for other cryptosystems that use similar nonlinear functions. For example, the 5-to-1 nonlinear filter functions used by the Crypto-1 stream cipher as found in MIFARE classic RFID cards could be double-checked or further optimized. If that's possible, it seems they would be much easier to find.

What is your estimation of feasibility to port (a more modular version of) the algorithm you've implemented to a GPU computing platform like OpenCL?

dansarie commented 6 years ago

Thanks for your comments! First of all: feel free to add these as separate issues – that's what the issues list is there for.

I'll see if I can find some time to work on this during the summer.

Simplification / modularization

Yes, the code layout is abysmal. This is probably the biggest issue with the code currently. Some of the design issues, such as the ifdefs were necessary to get the code to work both with and without MPI. I may have to remove either the MPI or the non-MPI functionality to improve code readability as well as testability. Other issues will resolve themselves once I improve the generalization.

Generalization

I've actually got this planned. First, allowing arbitrary sets of Boolean functions to be used instead of just NOT, AND, OR, XOR, and ANDNOT would allow for tailoring for various processor architectures, FPGAs, GPUs etcetera. Second, it would be easy to allow arbitrary numbers of input and output bits, provided the number of input bits is less than eight. (More than eight input bits would require some additional coding and slow the algorithm down.)

Portability

This will probably be hard. I've actually put some thought into this before and come to the conclusion that the basic algorithm is fairly hard to parallelize.

aczid commented 6 years ago

Good to hear you're thinking along the same lines. Once you get the first two issues out of the way, portability can always be reconsidered. I would be in favor of targeting the pthreads interface, which can probably be made compatible with MPI with little effort. I look forward to seeing what you make of it.

dansarie commented 5 years ago

Posting the output log here so it doesn't disappear again.

$ bash build-mpi.sh && mpirun ./sboxgates-mpi -l -o 0
Generating graphs for output 0...
[   0] Search 5.
[   0] Search 7.
[   0] No LUTs found. Num gates: 0
[   0] Search 5.
[   0] Search 7.
[   0] No LUTs found. Num gates: 0
[   0] Search 5.
[   0] Search 7.
[   0] No LUTs found. Num gates: 0
[   0] Search 5.
[   0] Search 7.
[   0] No LUTs found. Num gates: 0
[   0] Search 5.
[   0] Search 7.
[   0] No LUTs found. Num gates: 0
[   0] Search 5.
[   0] Search 7.
[   0] No LUTs found. Num gates: 3
[   0] Search 5.
[   0] Search 7.
[   0] No LUTs found. Num gates: 0
[   0] Search 5.
[   0] Search 7.
[   0] No LUTs found. Num gates: 3
[   0] Search 5.
[   0] Search 7.
[   0] No LUTs found. Num gates: 0
[   0] Search 5.
[   0] Search 7.
sboxgates-mpi: sboxgates.c:579: get_search_result: Assertion `*send_req == MPI_REQUEST_NULL' failed.
[laptop:16881] *** Process received signal ***
[laptop:16881] Signal: Aborted (6)
[laptop:16881] Signal code:  (-6)
[laptop:16881] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x13150)[0x7fa919693150]
[laptop:16881] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fa9192d70bb]
[laptop:16881] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16d)[0x7fa9192d8f5d]
[laptop:16881] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2ef17)[0x7fa9192cef17]
[laptop:16881] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2efc2)[0x7fa9192cefc2]
[laptop:16881] [ 5] ./sboxgates-mpi(+0x351e)[0x5611c042f51e]
[laptop:16881] [ 6] ./sboxgates-mpi(+0x680c)[0x5611c043280c]
[laptop:16881] [ 7] ./sboxgates-mpi(+0x9330)[0x5611c0435330]
[laptop:16881] [ 8] ./sboxgates-mpi(+0x9cef)[0x5611c0435cef]
[laptop:16881] [ 9] ./sboxgates-mpi(+0x9c56)[0x5611c0435c56]
[laptop:16881] [10] ./sboxgates-mpi(+0x9c56)[0x5611c0435c56]
[laptop:16881] [11] ./sboxgates-mpi(+0x9c56)[0x5611c0435c56]
[laptop:16881] [12] ./sboxgates-mpi(+0xaca6)[0x5611c0436ca6]
[laptop:16881] [13] ./sboxgates-mpi(+0x275b)[0x5611c042e75b]
[laptop:16881] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fa9192c11c1]
[laptop:16881] [15] ./sboxgates-mpi(+0x30ea)[0x5611c042f0ea]
[laptop:16881] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node laptop exited on signal 6 (Aborted).
--------------------------------------------------------------------------
aczid commented 5 years ago

Hey, this is great news! I see you've also added many other improvements. Seems like it works flawlessly for my use case now. Currently running on 128 power8 cores. Are you interested in me submitting a PR to remove the x86intrin.h dependency?

dansarie commented 5 years ago

Good to hear! Sure, submit a PR. I'll merge it if it doesn't change code structure and performance too much.