Performance issue in C++ standalone (probably platform/compiler-specific)

mstimberg commented 7 years ago

While running a standard HH model (see code in the end), I noted that the example slowed down dramatically when I switched to standalone mode. I could reproduce this issue on two Linux machines, but not on a Linux cluster where I had SSH access. The issue basically goes away when re-introducing -ffinite-math-only (which we removed as a standard option because it can hide NaN values). I tried to dig into it a bit and find out what the differences to the code running on the cluster is, and it pointed to AVX instructions -- the code on the cluster seemed to use a SSE exp function, while the code on my machine used an AVX variant. Using -mno-avx, or removing -march=native does indeed make things a bit faster, but still not as fast as runtime. This is all very weird, given that runtime mode seems to be using AVX instructions without any speed penalty... Looking at the generated code, I really don't see any significant difference between the runtime and the standalone code, except that standalone code replaces constants by their values (which should make things faster, if anything...). This is all really confusing. I could understand if AVX instructions made things slower instead of faster (apparently switching between AVX and SEE can slow things down, and proper AVX support for mathematical functions seem to depend on special libraries, i.e. we'd probably have to use the intel compiler and link against the MKL or something like that), but why on earth does this not affect runtime weave/cython ...‽‽‽

I guess this is Linux/gcc/whatever-specific, or can you confirm this on Windows?

Here quick benchmark values for various compiler options (1s of the below model, basically all spent in the state updater)

	runtime (weave)	runtime (cython)	standalone
default	9s	10s	48s (!)
`-noavx`	9s	10s	30s (!)
`-ffinite-math-only`	8s	10s	7s

defaultclock.dt = 0.01*ms

El = 10.613*mV
ENa = 115*mV
EK = -12*mV
gl = 0.3*msiemens/cm**2
gNa0 = 120*msiemens/cm**2
gK = 36*msiemens/cm**2
C = 1*uF/cm**2

eqs = '''
dv/dt = (gl * (El-v) + gNa * m**3 * h * (ENa-v) + gK * n**4 * (EK-v) + I) / C : volt
I : amp/meter**2
gNa : siemens/meter**2
dm/dt = alpham * (1-m) - betam * m : 1
dn/dt = alphan * (1-n) - betan * n : 1
dh/dt = alphah * (1-h) - betah * h : 1
alpham = (0.1/mV) * (-v+25*mV) / (exp((-v+25*mV) / (10*mV)) - 1)/ms : Hz
# alpham = alpha_fun(v) : Hz
betam = 4 * exp(-v/(18*mV))/ms : Hz
alphah = 0.07 * exp(-v/(20*mV))/ms : Hz
betah = 1/(exp((-v+30*mV) / (10*mV)) + 1)/ms : Hz
alphan = (0.01/mV) * (-v+10*mV) / (exp((-v+10*mV) / (10*mV)) - 1)/ms : Hz
betan = 0.125*exp(-v/(80*mV))/ms : Hz
'''

axon = NeuronGroup(500, eqs, method='exponential_euler', threshold='v>50*mV', refractory='v>50*mV')

thesamovar commented 7 years ago

On my windows laptop, using msvc to compile, I get weave 16.5s, cython 15s, standalone 12s. All default compiler switches, ['/Ox', '/w', '/arch:SSE2', '/MP'].

thesamovar commented 7 years ago

Same results but faster on desktop computer. It's only using SSE2 though. I think that's because on windows I only have the for Python version of MSVC, which is an old version that doesn't support AVX. Potentially would have the same issue with AVX.

thesamovar commented 7 years ago

Couple of ideas.

Did you check the exact flags sent to the compiler for both runtime and standalone? It may be that weave adds in some flags that aren't there on standalone?

Do the values in this model decay to zero during the run? If so, it could be the old issue with denormal/subnormal numbers issue? We have a bit of code somewhere for gcc that forces them to round to zero which can speed things up a lot - perhaps this isn't making it to the standalone code?

mstimberg commented 7 years ago

Ok, I'm giving up... I did not manage to pin down the true cause of this strange behaviour, thanks for your suggestions, though.

Did you check the exact flags sent to the compiler for both runtime and standalone? It may be that weave adds in some flags that aren't there on standalone?

weave does indeed add a few more flags (e.g. -frwapv), but adding them to standalone did not make any difference.

Do the values in this model decay to zero during the run? If so, it could be the old issue with denormal/subnormal numbers issue? We have a bit of code somewhere for gcc that forces them to round to zero which can speed things up a lot - perhaps this isn't making it to the standalone code?

Also a good point, but we don't enable the code for that by default (it's a preference), neither for weave nor for standalone. Enabling it did not change anything.

So, I think it is some gcc quirk where for some combination of equations/parameters optimisations go wrong. I can reproduce it with two different gcc versions, though (4.8.5 and 5.4.0), so it's not a bug introduced with the recent release. However, it does occur very rarely, I do not see the same issue with any other example I tried, including the very similar COBAHH example.

To end on a positive note, I did however find a way to workaround it on my machine, by using clang instead of gcc. Conveniently, we can actually switch this with a preference:

prefs.devices.cpp_standalone.extra_make_args_unix += ['CC=clang++']

thesamovar commented 7 years ago

Weird. OK, well maybe just put that under the known issues / workarounds page?

mstimberg commented 7 years ago

Weird. OK, well maybe just put that under the known issues / workarounds page?

yep, let's do that.

ghost commented 7 years ago

Hi guys! I was not able to reproduce the issue with latest (2.1+git) brian2 and gcc 4.8.4/5.4.1/6.2.0 on my ubuntu 14.04 machine. Both python2 and python3 sessions produce expected results (numpy > cython > weave (python2) > standalone). Standalone builds produced by gcc (versions mentioned above) are close to clang-3.9 build in terms of performance. Marcel, could you upload the full project (including generated output/ folder) somewhere?

mstimberg commented 7 years ago

Hi @xj8z, I'd be more than happy if you could shed some light on this issue! I uploaded my generated code here: http://s000.tinyupload.com/index.php?file_id=53613789273553615794 It also includes the compiled files in case you want to have a look at them with ldd or something like that. This was with g++ 5.4.0 on Ubuntu 16.04. The python code is exactly the one from the initial comment, with

from brian2 import *
set_device('cpp_standalone')
# ... (code from above)
run(1*second, report='text')

It might also be relevant on what machine you compile it (e.g. whether your CPU supports AVX/AVX2). In my case, I am on a Quadcore Intel(R) Xeon(R) CPU E5-1630 v3 @ 3.70GHz. This is what /proc/cpuinfo reports for the flags:

flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts

ghost commented 7 years ago

This issue is not compiler related. Identical 'main' binaries compiled for avx (avx2 is not required to reproduce the issue) demonstrate 7x performance difference while running with different ld/libc/libm software stacks. To be more specific, if you move dynamically compiled 'main' binary from ubuntu 16.04 (glibc 2.23) to ubuntu 14.04 (glibc 2.19) you can observe 7x acceleration mentioned above. My current understanding is that glibc packaged with ubuntu 16.04 fails to detect cpu flags correctly (it does it when application starts) and fallbacks to non-optimized versions of math functions (e.g. libm's exp) while glibc from ubuntu 14.04 detects cpu flags correctly and runs optimized versions (e.g. __ieee754_exp_avx) instead. That's why the same binary is 7x faster on ubuntu 14.04. It just uses optimized libm routines. To avoid this dynamic misbehavior of glibc packaged with 16.04 you may try to statically compile 'main' binary (manually add '-static' to LFLAGS in the makefile). We may use this dirty fix while I'm trying to fix the real problem. Stay tuned.

mstimberg commented 7 years ago

Great, many thanks for looking into this! Statically linking does indeed solve the problem on my machine, that's a much better workaround then switching the compiler (not everyone has clang installed). You can actually do this from within the Python script:

prefs.codegen.cpp.extra_link_args += ['-static']

Here two links about the same problem, the Ubuntu bug report did not get any response, though: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1598618 http://stackoverflow.com/questions/38172066/the-program-runs-3-times-slower-when-compiled-with-g-5-3-1-than-the-same-progr

ghost commented 7 years ago

That's a bug in Glibc. It has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. Both exp() and pow() are impacted. Ubuntu 16.04, 16.10 and not yet released 17.04 use codebase with the bug. Standalone binary will suffer from performance degradation while running on these versions of Ubuntu. I'm not sure that the proper fix will be backported from Glibc 2.25 (impact/risk ratio needs to be estimated first) which means that we need to come up with a suitable workaround.

What happened

This bug hits AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. This operation is time consuming (40-80 cycles). Store/restore is required because old-fashioned SSE knows nothing about the upper halfs of its registers and may damage them. To avoid this issue, Intel introduced AVX-128 instructions which operate on the same 128-bit XMM register as SSE but take into account upper halfs of YMM registers. Hence, no store/restore required. Practically speaking, AVX-128 instructions is a new smart form of SSE instructions which can be used together with full-size AVX-256 instructions without any penalty. Intel recommends to use AVX-128 instructions instead of SSE instructions wherever possible. To sum things up, it's okay to mix SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because both types of instructions are aware of 256-bit YMM registers. Mixing SSE with AVX-128 is okay because CPU can guarantee that the upper halfs of YMM registers don't contain any meaningful data (how one can put it there without using AVX-256 instructions?) and avoid doing store/restore operation (why to care about random trash in the upper halfs of the YMM registers?). It's not okay to mix SSE with AVX-256 due to the transition penalty. But Glibc does that.

You may ask why do we care about vector instructions (SSE/AVX-128/AVX-256) if what we do in the example program is scalar computation? We care because scalar floating-point intructions are implemented as a subset of SSE and AVX-128 instructions. They operate on a small fraction of 128-bit register but still considered SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty as well.

Now let's see what happens inside libm library when we call exp() from the main executable.

(a) When we call exp() from the binary compiled with -static -ffinite-math-only

This is a simplest scenario. Executable is statically linked which means that no on-the-fly actions are required from the loader. Usage of finite math operations allow us to use __ieee754_exp_avx directly and do not handle inf corner cases.

main (floating point code compiled as AVX-128) __ieee754_exp_avx (floating point code compiled as AVX-128) [4]

All the code is compiled as AVX-128. No penalty.

(b) When we call exp() from the binary compiled with -static -fno-finite-math-only

We asked libm to handle inf corner cases so it calls __ieee754_exp_avx() from additional inf-processing wrapper.

main (floating point code compiled as AVX-128) exp (floating point code compiled as SSE for portability) [5] ieee754_exp_avx (floating point code compiled as AVX-128) [4]

Note that exp() inf-processing wrapper uses SSE instructions. Glibc contains multiple variations of ieee754_exp funtion (ieee754_exp_sse, ieee754_exp_avx) but only single exp(). To be able to use Glibc on non-AVX machines exp() is compiled as SSE code.

Still no penatly because we mix SSE with AVX-128. No AVX-256 means no problems.

(c) When we call exp() from the binary compiled without -static but with -fno-finite-math-only

Main binary is compiled as dynamic binary and on-the-fly actions are required from the loader to resolve exp() from libm during the first call to it (known as lazy linking). Inf-handling wrapper is also required.

main (floating point code compiled as AVX-128) _dl_runtime_resolve (uses AVX-256 instructions to push/pop AVX registers) [6] exp (floating point code compiled as SSE for portability) [5] ieee754_exp_avx (floating point code compiled as AVX-128) [4]

Dynamic linker uses AVX-256 instructions to push AVX registers to the stack at the very beginning of its operation and pop them back at the end. That was done to allow symbol resolver to use (overwrite) some of these registers while looking for a requested symbol.

Now we mix SSE, AVX-128 and AVX-256 instructions. And observe SSE/AVX transition penalty.

Please note that the route mentioned in (c) will be hit by the main executable only once. When exp() symbol will be resolved by the dynamic loader its address will be cached and all future accesses to exp() won't require a call to _dl_runtime_resolve(). Hence, all other accesses to exp() will follow route (b).

But the key point here is that it doesn't really matter. All future accesses to exp() will suffer from SSE/AVX transition penalty. Even if they follow route (b). When _dl_runtime_resolve() uses AVX-256 instructions once during symbol lookup it marks upper (non-XMM) half of YMM register as dirty. Dirty means that the upper half of the YMM register may contain some meaningful data and CPU needs to store/restore these bits during transitioning to SSE. But no one drop this dirty flag during the whole program execution. With this flag set even transition from SSE to AVX-128 (and back) requires to store/restore the upper part of the register. CPU faithfully stores upper half of the YMM register while switching from AVX-128 to SSE (main -> exp() and back) and restores it back while switching from SSE to AVX-128 (exp() -> __ieee754_exp_avx() and back). This transition happens multiple times for every exp() call. And it really hurts performance.

Fixing Ubuntu and other distros

As I said before, this issue has been fixed in Glibc 2.25 [2]. Fix is quite straightforward. Implementation of _dl_runtime_resolve() tries to avoid using AVX-256 instructions if possible. Unfortunately, many distros (including Ubuntu 16.04/16.10/17.04) use older versions of Glibc (2.23 and 2.24) which don't contain this fix. Ubuntu 16.04 is the key player here because it is widely used LTS release. There is no way for LTS to update its Glibc from 2.23 to 2.25 because it may break binary compatibility and will probably introduce new bugs. The only thing we can do is to backport this specific fix from Glibc 2.25 to Glibc 2.23 and request for SRU (stable release update). Maintainers will decide if impact/risk ratio of the fix is high enough to update stable release which is in production use.

I'll try to discuss the possibility of SRU with the distro maintainers but can't guarantee that their decision will be positive. I suppose that the issue with exp() has been around for so long because most part of developers compile their code with -ffast-math which automatically sets -ffinite-math-only. And as we discussed before, -fno-finite-math-only is absolutely required to reproduce the issue with SSE/AVX transition with exp(). It means that in case of exp() 'impact' part of the impact/risk ratio is moderately small because just a few of us will feel the difference. Situation with pow() is much worse though. According to my findings it suffers from performance degradation even if the application has been compiled with -ffinite-math-only. This happens because the whole pow() procedure is implemented with SSE inside libm. Hence, no need in inf-wrapper which contains SSE instructions. So, basically, we have pow() which demonstrates performance degradation all the time and exp() which demonstrates it only when compiled with -fno-finite-math-only. Does fixing these issues worth touching mission-critical component of the stable release? I don't know. I'd better commit a suitalbe workaround and then wait for a SRU decision.

Workarounds

There are many different ways to workaround the issue:

(1) building main executable statically with -static compilation flag

By doing static compilation we avoid on-the-fly (lazy) linking and prevent buggy _dl_runtime_resolve() from running. No AVX-256 instructions get executed. No penalty.

_(2) running main executable with LD_BINDNOW=1 flag

By passing this flag we force dynamic linker to resolve all the symbols during application startup. Again, no lazy linking and no call to buggy _dl_runtime_resolve().

(3) compile main binary with -ffinite-math-only instead of -fno-finite-math-only

By using finite math only we prevent inf-wrapper from running. This solves the issue with exp(). But pow() still suffers from performance degradation because its body consists of SSE instructions not just inf-handling wrapper.

(4) manually drop 'dirty' flag after a call to exp() and pow()

Special intrinsic __builtin_ia32_vzeroupper() can be used after a call to exp() and pow() to let CPU know that we don't care about the upper halfs of YMM registers. It basically clears dirty flag set by _dl_runtime_resolve(). This intrinsic is translated into VZEROUPPER instruction which is pretty lightweight to execute. Theoretically, you need to call it just once, after a first call to exp() and pow() when _dl_runtime_resolve() has been involved. In practice, you may call it after every call to these functions. It shouldn't hurt performance but is much simpler to implement this way. The problem with this approach is that you need to know if it's safe to call VZEROUPPER or not. If your code (main binary) uses AVX-256 instruction you may destroy meaningful data by manually zeroing upper halfs of all YMM registers.

(5) come up with your own inf-processing wrapper and call finite version of exp() from it

Inf-processing wrapper implemented in libm is quite simple [5]. By making your own alternative to it you can make sure that it gets compiled the same way as the main binary (as AVX code or as SSE code). All elements of the call chain (main -> inf-wrapper -> __ieee754_exp_avx) will use AVX instructions. Hence, no penalty for exp(). Calls to pow() will still suffer.

I'd probably stick to (1) or (2). Or maybe (3) if you're sure that pow() is not used by popular models.

Q&A

(1) Only HH model suffers from performance degradation. Why does this happen?

HH model heavily relies on exp() from libm. That's why performance degradation of exp() seriously affects overall performance of the model. For some other model overall performance impact may be much less, sometimes even negligible.

(2) Main binary compiled with Clang doesn't suffer from performance degradation. Why does this happen?

Clang ignores -fno-finite-math-only flag if it passed together with -ffast-math. Hence, no inf-handling wrapper gets called. No penalty. That's probably a bug in Clang.

(3) Why does -mno-avx build demonstrate better performance than -mavx build, not as good as runtime though?

Number of SSE/AVX transitions differs. In case of -mavx build you have two in every direction (avx in main -> sse in inf-wrapper -> avx in finite exp handler). In case of -mno-avx you have just one in every direction (sse in main -> sse in inf-wrapper -> avx in finite exp handler). Due to the fact that exp() call should return back to the main binary you need to multiply the number of transition by 2. So, basically, we have either 4 transitions (-mavx) or 2 transitions (-mno-avx). With runtime we have 0 transitions.

(4) Runtime build generated by weave/cython doesn't suffer from performance degradation. Why does this happen?

Weave (and probably Cython) compiles target binary differently then standalone. Instead of compiling an executable, it compiles a shared object (library) and then links it to the running python instance (using import statement which is translated into dlopen). I suppose that it leads to a different symbol resolution path which doesn't trigger buggy _dl_runtime_resolve(). For instance, if exp() symbol has been already resolved in the parent python instance, dynamic linker won't be looking for it the second time. It will simply copy previously found address without a call to _dl_runtime_resolve().

Links

[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e [2] https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604 [3] https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx [4] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/ieee754/dbl-64/e_exp.c;h=6757a14ce1c132d3a4363badcfd59ca4a5435c27;hb=HEAD#l55 [5] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/ieee754/dbl-64/w_exp.c;h=e61e03b3356803394d9f1fd9fb8fe88821686974;hb=HEAD#l24 [6] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l157

mstimberg commented 7 years ago

Wow, I did not expect such an in-depth analysis! Many thanks, this clears it up nicely (did you also add all this info somewhere else, e.g. in a Debian/Ubuntu bug report?). So let's consider the possible workarounds: I'd avoid (4) and (5) because this adds quite a bit of glibc-specific code, can potentially break things (and frankly, I don't feel very confident doing this kind of low-level hacking), and all in all is quite a bit of work for a problem that will disappear "by itself" in the future (though, as you explained, it might be still around for quite a while in Ubuntu 16.04). I'd also exlucde (3), we introduced -ffinite-math-only for a reason (see discussion here: #750) -- numerically unstable simulations can otherwise give incorrect results instead of NaNs and a warning by Brian about potentially unstable integration.

In general, (1) would be the easiest solution, since we could just add -static to our extra_link_args preference. Unfortunately, compilation with weave then does not work anymore... We could of course have different compiler arguments for standalone and weave, but it would be great if we could avoid that.

So, maybe (2) is the best option. We could have a new general preference to set environment variables during execution which we would set by default to LD_BIND_NOW=1 which would work fine with both standalone and weave. This preference might also be useful in the future when users want to further tweak stuff. If I understood ld's documentation correctly, compiling our executable with -Wl,-znow should have the same effect as LD_BIND_NOW, but when I tested it, the simulation was still slow and LD_DEBUG showed that it was still lazily binding (even though elfread showed that the BIND_NOW flag was set correctly in the executable), no idea why.

As a side note, replacing glibc's libm by openlibm leads to a performance increase of ~30% for the example we discussed here (but without LD_BIND_NOW, it is still slow)! This doesn't even need a recompilation, you can use LD_PRELOAD (so this even works for the numpy target without recompiling numpy).

BTW: when we acknowledge your work on this in the release notes, should we use your github handle or your real name? In the latter case, you'd have to give it to us :smile:

Either way, many thanks again -- my workaround to use clang was not a good workaround, it seems...

ghost commented 7 years ago

I just opened a bug against glibc package in Ubuntu: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280 Situation seems to be much worse than we initially thought. Huge amount of math functions provided by libm are affected by AVX/SSE transition penalty in one way or another. Routines which have AVX-optimized implementation (exp, log, sin/cos/tan) experience slowdown when get called from SSE-only code (generated by gcc -march=x86-64 which is a default for Ubuntu packages). Routines which don't have AVX-optimized implementation and rely on general purpose SSE implementation (pow, exp2/exp10, log2/log10, sincos, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh) experience slowdown when get called from AVX-optimized code (generated by gcc -march=native on AVX-capable machines). I believe that this issue is worth fixing in 16.04. Thanks a lot for discovery of this bug and making this information public.

I didn't mention -Wl,-z,now in the workarounds section because it didn't work for me either. It seems to me that -Wl,-z,now and LD_BIND_NOW=1 try to achieve the same goal but do it differently. I can see (using gdb) that -Wl,-z,now indeed does the job. First call to exp() goes directly to __exp() wrapper without a hop to _dl_runtime_resolve(). But I suspect that -Wl,-z,now does pre-run symbol resolution using our good friend _dl_runtime_resolve(). While it does the job (symbols are resolved before application starts) it doesn't fix the bug because no matter when _dl_runtime_resolve() gets called it provokes AVX/SSE transition penalty. LD_BIND_NOW probably uses some other routines to do symbol resolution.

I updated my GitHub profile. Feel free to use my real name instead of a nickname in the release notes. Thanks!

mstimberg commented 7 years ago

Ok, great, I marked me as affected by the Ubuntu bug so its status is now "confirmed". Hopefully it gets attention from maintainers soon.

I'll try to update Brian's documentation and add the LD_BIND_NOW workaround soon. Thanks again for spending so much time investigating this issue.

ghost commented 7 years ago

Fedora 24/25 is also affected by this bug. I just filed a bug there as well: https://bugzilla.redhat.com/show_bug.cgi?id=1421121 RHEL7 contains quite old Glibc version and doesn't suffer from performance degradation. New RHEL8 (ETA 2018) will probably use Glibc 2.25 (or newer) which already contains the fix.

denisalevi commented 7 years ago

Since I have close to no knowledge about low level instruction stuff, just to make sure: AVX extensions are only relevant for CPU instructions and have nothing to do with GPU instruction, is that right? So CUDA code won't use them or won't be affected by this bug?

mstimberg commented 7 years ago

AVX extensions are only relevant for CPU instructions and have nothing to do with GPU instruction, is that right? So CUDA code won't use them or won't be affected by this bug?

Yes, this is strictly about CPU code (and additionally needs the combination of specific hardware and a specific version of glibc on a Linux system).

denisalevi commented 7 years ago

Ok, thanks.

brian-team / brian2

Performance issue in C++ standalone (probably platform/compiler-specific) #803