Closed GoogleCodeExporter closed 8 years ago
Jim, the CUDA compiler produces slightly different code when compiling for 32
or 64
bit targets. My first guess is that what you are seeing are just small floating
point
differences. The 64 bit target supports 64 bit pointers, which may result in
larger
register space requirements, and slightly different optimization strategies.
I'll try
to confirm this later tonight or tomorrow.
Original comment by cast...@gmail.com
on 26 May 2008 at 9:01
Oh, I'm sorry Ignacio, I didn't make it clear that I don't have cuda enabled for
either of the executables (or so I think, I might have made an error). This
should be
for the straight up CPU implementation. And in some cases the differences are
quite
noticeable.
Original comment by jim.tila...@gmail.com
on 26 May 2008 at 9:03
Oh, I see. I would still assume it's just floating point differences. The
compiler
has twice as many SSE registers when targeting x64. So, it could lay down the
expressions in a slightly different way. However, it could be something else. I
remember Simon mentioned a similar issue on some targets. Let me check with him.
Original comment by cast...@gmail.com
on 26 May 2008 at 9:20
The issue I discovered was that RCPSS is implemented differently on Intel and
AMD
hardware, which could result in different encodings on these two platforms. A
workaround is to mask off some of the low bits of the estimate, which I'm
considering
for the next squish release.
I can't see how a larger register file would alter the results (thankfully SSE
registers have no hidden bits like FPU ones), so I assume there must be some
instruction differences between the two builds. Has anyone compared the asm
for the
inner loops between platforms?
Original comment by sidm...@gmail.com
on 29 May 2008 at 12:25
I don't know about msvc, but gcc produces very different code when compiling
for x86
and x64 targets. Last time I checked NVTT was a bit faster in 64 bit mode. There
might be other differences, but the most important one is the doubled register
count.
I'll have a closer look at the code over the weekend.
Original comment by cast...@gmail.com
on 29 May 2008 at 7:15
I have not checked the output assembly, although as I think I saw in the squish
library, most of the calculations are done through SSE, right? It was my
understanding that the SSE stuff was pretty much identical across the two modes
(32/64 bit). It's not as if I'm actually switching processor, it's the same
machine
I'm running the different executables on... so I don't think it's some
instructions
that are different. The larger registerfile should only matter if we are
compiling
the library with the unsafe math transformations enabled (I'm not sure if we
are) but
under just normal ANSI rules, there should be no differences from the compiler
in the
emitted code, regardless of the larger register file... or so I think :)
I'm very curious if you find any thing Ignacio, as this really gives me a
slightly
uneasy feeling as we're transitioning from 32-bit to exclusive 64-bit...
Original comment by jim.tila...@gmail.com
on 31 May 2008 at 6:35
Yes, the code is compiled with the "precise" floating point model.
I had a closer look at the asm code for 32 and 64 targets, and while
instructions are
schedulled very differently, I analyzed a few expressions, and they seem to be
coded
the same way.
I guess I'll have to debug it side by side in order to find out where the
computations diverge. I'll let you know if I find anything.
Original comment by cast...@gmail.com
on 19 Jun 2008 at 1:19
Ok, I've located the problem. The function ComputePrincipleComponent produces
slightly different results in 64 and 32 bit targets.
This function uses standard floating point arithmethic (no sse intrincics). The
32
bit compiler produces code that uses x87 instructions, even when SSE2 is
enabled. On
the other side, the 64 bit compiler always uses SSE instructions. This obviously
produces different results.
There are several possible workarounds. The ideal solution would be to
vectorize the
functions: ComputePrincipleComponent and ComputeWeightedCovariance. This is not
too
hard, and would produce even slightly results.
A more simple workaround is to reduce the x87 floating point precision. That
seems to
work, at least with this particular code, where there are no transcendental
functions.
I'm not gonna do anything to fix this on my side, but applications can set the
floating point flags themselves:
_controlfp(_PC_24, _MCW_PC);
Let me know if that works for you.
Original comment by cast...@gmail.com
on 19 Jun 2008 at 9:30
I've added a wiki page explaining the issue and the workaround:
http://code.google.com/p/nvidia-texture-tools/wiki/CompressionDifferences
Original comment by cast...@gmail.com
on 19 Jun 2008 at 9:50
Thanks Ignacio for tracking this one down. Much like you I thought that the
regular
32 bit version also only had SSE instructions in the implementation, but it
seems
that there were a few regular scalar floating point instructions.
No worries on my part, we're all 64 bit here :)
Original comment by jim.tila...@gmail.com
on 20 Jun 2008 at 5:46
Original issue reported on code.google.com by
jim.tila...@gmail.com
on 24 May 2008 at 12:54Attachments: