LZ4_uncompress throws exception with Windows CE

GoogleCodeExporter commented 8 years ago

I have compiled the latest code using Visual Studio 2005 and the generic CE 6.0 
SDK, as well as Pocket PC 2003 and a specific handheld device, all ARM4 based. 
_FORCE_SW_BITCOUNT is defined (though generates the warning "unary minus 
operator applied to unsigned type, result still unsigned".)

The LZ4_uncompress function throws an exception at LZ4_WILDCOPY (line 657 of 
lz4.c.)

Original issue reported on code.google.com by joewoodb...@gmail.com on 17 Feb 2012 at 1:07

GoogleCodeExporter commented 8 years ago

Hi Joe

Here is a potential scenario :

ARM4 is a CPU which likely requires data to be strictly aligned.
With such CPU any access to a (long*) at an address which is not a multiple of 
4 will either throws an exception or have sluggish performances.

This situation is automatically taken care of into the LZ4 source code, using 
__attribute__ ((packed)). 

Unfortunately, this attribute is only available with GCC, and you are using 
Visual Studio.

However, the problem should happen on both compression and decompression 
function though. Have you tested compression on your ARM CPU ?

A quick (manual) way to solve the issue in the short term would be to replace 
#define LZ4_COPYSTEP(s,d)   A32(d) = A32(s); d+=4; s+=4;
by
#define LZ4_COPYSTEP(s,d)   *d++ = *s++; *d++ = *s++; *d++ = *s++; *d++ = *s++; 

Would you mind testing it ?

Regards

Original comment by yann.col...@gmail.com on 17 Feb 2012 at 12:59

GoogleCodeExporter commented 8 years ago

Your solution worked for handling the LZ4_WILDCOPY, but then broke again on 
LZ4_READ_LITTLEENDIAN_16.

The compression code also fails, but the program is only decompressing data 
sent by an application on a server, so I wasn't worried about that.

Thank you.

Original comment by joewoodb...@gmail.com on 17 Feb 2012 at 3:50

GoogleCodeExporter commented 8 years ago

OK, thanks for reporting, it allows to target the right reason for correction.

As a quick solution, you may (manually) change the following line
#define LZ4_READ_LITTLEENDIAN_16(d,s,p) { d = (s) - A16(p); }
into
#define LZ4_READ_LITTLEENDIAN_16(d,s,p) { int delta = p[0]+(p[1]<<8); d = 
(s)-delta; }

I will also try to include these corrections into the main source code.

Regards

Original comment by yann.col...@gmail.com on 17 Feb 2012 at 4:04

GoogleCodeExporter commented 8 years ago

Original comment by yann.col...@gmail.com on 17 Feb 2012 at 4:04

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

I came up with another solution. MSVC has the packed pragma, so I did the 
following:

#if defined(_MSC_VER) && defined(_ARM_) && 
(!defined(LZ4_FORCE_UNALIGNED_ACCESS))
#pragma pack(push)
#pragma pack(1)
#endif

typedef struct _U64_S
{
    U64 v;
} _PACKED U64_S;

typedef struct _U32_S
{
    U32 v;
} _PACKED U32_S;

typedef struct _U16_S
{
    U16 v;
} _PACKED U16_S;

#if defined(_MSC_VER) && defined(_ARM_) && 
(!defined(LZ4_FORCE_UNALIGNED_ACCESS))
#pragma pack(pop)
#endif

To keep the aligned stuff in one place, I modified the code as follows:

#if (defined(__GNUC__) && (!defined(LZ4_FORCE_UNALIGNED_ACCESS)))
#define _PACKED __attribute__ ((packed))
#define _START_PACKING
#define _END_PACKING
#elif (defined(_MSC_VER) && defined(_ARM_) && 
(!defined(LZ4_FORCE_UNALIGNED_ACCESS)))
#define _PACKED
#define _START_PACKING __pragma( pack(push, 1) )
#define _END_PACKING   __pragma( pack(pop) )
#else
#define _PACKED
#define _START_PACKING
#define _END_PACKING
#endif

I then replaced the if/pragma/endif blocks above with _START_PACKING and 
_END_PACKING

(To be honest, I'm still not sure why this works. It would seem that you could 
still get misaligned accesses.)

Original comment by joewoodb...@gmail.com on 17 Feb 2012 at 4:28

GoogleCodeExporter commented 8 years ago

Thanks for the hint.
I'll look into your proposal.

Original comment by yann.col...@gmail.com on 17 Feb 2012 at 5:40

GoogleCodeExporter commented 8 years ago

I just ran both ideas on one of our CE devices. Your code was faster.

Original comment by joewoodb...@gmail.com on 17 Feb 2012 at 6:25

GoogleCodeExporter commented 8 years ago

There is an interesting entry on MSDN which might explain why :
http://msdn.microsoft.com/en-us/library/aa290049(v=vs.71).aspx
(notably fig.2)

Original comment by yann.col...@gmail.com on 17 Feb 2012 at 9:01

GoogleCodeExporter commented 8 years ago

Also : was the speed difference you mentioned significant ?

I like your proposed solution because it makes code maintenance much easier. on 
the other hand, if the difference is large, it might be worth going the tricky 
"manual optimization" route.

Original comment by yann.col...@gmail.com on 17 Feb 2012 at 9:33

GoogleCodeExporter commented 8 years ago

For information
you will find in attached file
a proposed update rc56
which features a correction for MSVC on Strictly Aligned CPUs
inspired by your suggestion.

Please feel free to comment

Original comment by yann.col...@gmail.com on 17 Feb 2012 at 9:50

Attachments:

LZ4rc56.zip

GoogleCodeExporter commented 8 years ago

On an OMAP4430 600 Mhz with the Visual Studio 2008 ARMV4 compiler set for speed 
optimization, I decoded a compressed bitmap from 22 KB to 750 KB (these bitmaps 
have a lot of white space.) I ran 11 iterations (11 so I could correlate it 
with other tests) and ran the test several times. The "manual optimization" 
took 0.097-0.111 seconds, the pragma route took 0.135-0.154 seconds.

Original comment by joewoodb...@gmail.com on 17 Feb 2012 at 10:46

GoogleCodeExporter commented 8 years ago

OK, so it's significant.

Do you know if there is some kind of pre-defined macro which tells if the 
target CPU supports Unaligned memory access or not ?
Something like this will be needed to branch the code towards the "manual" 
memory routines.

In the meantime, please find another rc56 version, which is still based on the 
same "packed" idea and should be a bit more generic (code is shorter, and 
unaligned memory access can be "forced" as before).

Original comment by yann.col...@gmail.com on 17 Feb 2012 at 10:58

Attachments:

LZ4rc56.zip

GoogleCodeExporter commented 8 years ago

Hi Joe

I currently plan to release r56 with the generic correction which applies to 
both GCC and Visual, in order to get a clean starting point. There are also a 
few other minor improvements that i want to release early.

Beyond that point, i understand that hand-made code for strict-align CPU is 
likely to improve performance for such systems. My only concern is that i need 
a proper test environment to develop, verify and benchmark code modifications, 
which are going to be significant.

The closest thing to your use case i could find around is an old HTC phone 
(Touch 3G) with Windows CE. My main issue however is that i don't know how to 
setup a development environment like yours, to build a C program for it. Is 
there any tutorial to create one, or even a ready-to-use environment (such as a 
VM) to download ?

Regards

Original comment by yann.col...@gmail.com on 20 Feb 2012 at 10:28

GoogleCodeExporter commented 8 years ago

To develop for CE, you need Visual Studio 2005 or 2008 professional with smart 
device support installed. For generic testing, you can generally use the Pocket 
PC 2003 device which is CE 4.2 and pretty much the minimum platform these days. 
Visual Studio comes with an emulator which is useful when first creating a CE 
application. The ARM emulator seems fairly consistent with actual devices.

You can debug to the device with ActiveSync (in Windows 7, they call it the 
Windows Mobile Device Center.) Either way, sometimes it just doesn't work right 
and I just have to copy files to a device and run them.

Original comment by joewoodb...@gmail.com on 20 Feb 2012 at 4:41

GoogleCodeExporter commented 8 years ago

Thanks for detailed advises.
It seems it's going to cost some time to follow all these requirements, 
nonetheless i will try.

Regards

Original comment by yann.col...@gmail.com on 21 Feb 2012 at 4:10

GoogleCodeExporter commented 8 years ago

The compilation problem with Visual Studio for ARM is solved in r56, thanks to 
the #pragma suggestion. I keep in my task list the need to optimise some 
routines for improved speed on strict-aligned CPUs.

Original comment by yann.col...@gmail.com on 21 Feb 2012 at 4:12

Changed state: Fixed

Klozz / lz4

LZ4_uncompress throws exception with Windows CE #10