64/32 bit output mismatch.

GoogleCodeExporter commented 8 years ago

For some images it seems that the 64 bit version of the tools doesn't
produce the same images as the 32 bit version of the tools, which is very
unsettling. It doesn't happen for example for the lena picture, but many
other pictures (I've attached a full testcase).

Run this on a regular 64 bit machine, after unzipping this and run:

32\nvcompress32.exe -bc1 causeway.dds 32_causeway.dds
64\nvcompress64.exe -bc1 causeway.dds 64_causeway.dds

That should reproduce the error.

The tools were compiled against revision 563 in the source repository.

This happens somewhat frequently for me, out of 141 files, 7 of them
exhibited these difference.

Original issue reported on code.google.com by jim.tila...@gmail.com on 24 May 2008 at 12:54

Attachments:

to_ignacio.zip

GoogleCodeExporter commented 8 years ago

Jim, the CUDA compiler produces slightly different code when compiling for 32 
or 64
bit targets. My first guess is that what you are seeing are just small floating 
point
differences. The 64 bit target supports 64 bit pointers, which may result in 
larger
register space requirements, and slightly different optimization strategies. 
I'll try
to confirm this later tonight or tomorrow.

Original comment by cast...@gmail.com on 26 May 2008 at 9:01

GoogleCodeExporter commented 8 years ago

Oh, I'm sorry Ignacio, I didn't make it clear that I don't have cuda enabled for
either of the executables (or so I think, I might have made an error). This 
should be
for the straight up CPU implementation. And in some cases the differences are 
quite
noticeable.

Original comment by jim.tila...@gmail.com on 26 May 2008 at 9:03

GoogleCodeExporter commented 8 years ago

Oh, I see. I would still assume it's just floating point differences. The 
compiler
has twice as many SSE registers when targeting x64. So, it could lay down the
expressions in a slightly different way. However, it could be something else. I
remember Simon mentioned a similar issue on some targets. Let me check with him.

Original comment by cast...@gmail.com on 26 May 2008 at 9:20

GoogleCodeExporter commented 8 years ago

The issue I discovered was that RCPSS is implemented differently on Intel and 
AMD
hardware, which could result in different encodings on these two platforms.  A
workaround is to mask off some of the low bits of the estimate, which I'm 
considering
for the next squish release.

I can't see how a larger register file would alter the results (thankfully SSE
registers have no hidden bits like FPU ones), so I assume there must be some
instruction differences between the two builds.  Has anyone compared the asm 
for the
inner loops between platforms?

Original comment by sidm...@gmail.com on 29 May 2008 at 12:25

GoogleCodeExporter commented 8 years ago

I don't know about msvc, but gcc produces very different code when compiling 
for x86
and x64 targets. Last time I checked NVTT was a bit faster in 64 bit mode. There
might be other differences, but the most important one is the doubled register 
count.

I'll have a closer look at the code over the weekend.

Original comment by cast...@gmail.com on 29 May 2008 at 7:15

GoogleCodeExporter commented 8 years ago

I have not checked the output assembly, although as I think I saw in the squish
library, most of the calculations are done through SSE, right? It was my
understanding that the SSE stuff was pretty much identical across the two modes
(32/64 bit). It's not as if I'm actually switching processor, it's the same 
machine
I'm running the different executables on... so I don't think it's some 
instructions
that are different. The larger registerfile should only matter if we are 
compiling
the library with the unsafe math transformations enabled (I'm not sure if we 
are) but
under just normal ANSI rules, there should be no differences from the compiler 
in the
emitted code, regardless of the larger register file... or so I think :)

I'm very curious if you find any thing Ignacio, as this really gives me a 
slightly
uneasy feeling as we're transitioning from 32-bit to exclusive 64-bit...

Original comment by jim.tila...@gmail.com on 31 May 2008 at 6:35

GoogleCodeExporter commented 8 years ago

Yes, the code is compiled with the "precise" floating point model.

I had a closer look at the asm code for 32 and 64 targets, and while 
instructions are
schedulled very differently, I analyzed a few expressions, and they seem to be 
coded
the same way.

I guess I'll have to debug it side by side in order to find out where the
computations diverge. I'll let you know if I find anything.

Original comment by cast...@gmail.com on 19 Jun 2008 at 1:19

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

Ok, I've located the problem. The function ComputePrincipleComponent produces
slightly different results in 64 and 32 bit targets. 

This function uses standard floating point arithmethic (no sse intrincics). The 
32
bit compiler produces code that uses x87 instructions, even when SSE2 is 
enabled. On
the other side, the 64 bit compiler always uses SSE instructions. This obviously
produces different results. 

There are several possible workarounds. The ideal solution would be to 
vectorize the
functions: ComputePrincipleComponent and ComputeWeightedCovariance. This is not 
too
hard, and would produce even slightly results.

A more simple workaround is to reduce the x87 floating point precision. That 
seems to
work, at least with this particular code, where there are no transcendental 
functions.

I'm not gonna do anything to fix this on my side, but applications can set the
floating point flags themselves:

_controlfp(_PC_24, _MCW_PC);

Let me know if that works for you.

Original comment by cast...@gmail.com on 19 Jun 2008 at 9:30

Changed state: WontFix

GoogleCodeExporter commented 8 years ago

I've added a wiki page explaining the issue and the workaround:

http://code.google.com/p/nvidia-texture-tools/wiki/CompressionDifferences

Original comment by cast...@gmail.com on 19 Jun 2008 at 9:50

GoogleCodeExporter commented 8 years ago

Thanks Ignacio for tracking this one down. Much like you I thought that the 
regular
32 bit version also only had SSE instructions in the implementation, but it 
seems
that there were a few regular scalar floating point instructions.

No worries on my part, we're all 64 bit here :)

Original comment by jim.tila...@gmail.com on 20 Jun 2008 at 5:46

dmsovetov / nvidia-texture-tools

64/32 bit output mismatch. #47