Closed GoogleCodeExporter closed 8 years ago
Hi Joe
Here is a potential scenario :
ARM4 is a CPU which likely requires data to be strictly aligned.
With such CPU any access to a (long*) at an address which is not a multiple of
4 will either throws an exception or have sluggish performances.
This situation is automatically taken care of into the LZ4 source code, using
__attribute__ ((packed)).
Unfortunately, this attribute is only available with GCC, and you are using
Visual Studio.
However, the problem should happen on both compression and decompression
function though. Have you tested compression on your ARM CPU ?
A quick (manual) way to solve the issue in the short term would be to replace
#define LZ4_COPYSTEP(s,d) A32(d) = A32(s); d+=4; s+=4;
by
#define LZ4_COPYSTEP(s,d) *d++ = *s++; *d++ = *s++; *d++ = *s++; *d++ = *s++;
Would you mind testing it ?
Regards
Original comment by yann.col...@gmail.com
on 17 Feb 2012 at 12:59
Your solution worked for handling the LZ4_WILDCOPY, but then broke again on
LZ4_READ_LITTLEENDIAN_16.
The compression code also fails, but the program is only decompressing data
sent by an application on a server, so I wasn't worried about that.
Thank you.
Original comment by joewoodb...@gmail.com
on 17 Feb 2012 at 3:50
OK, thanks for reporting, it allows to target the right reason for correction.
As a quick solution, you may (manually) change the following line
#define LZ4_READ_LITTLEENDIAN_16(d,s,p) { d = (s) - A16(p); }
into
#define LZ4_READ_LITTLEENDIAN_16(d,s,p) { int delta = p[0]+(p[1]<<8); d =
(s)-delta; }
I will also try to include these corrections into the main source code.
Regards
Original comment by yann.col...@gmail.com
on 17 Feb 2012 at 4:04
Original comment by yann.col...@gmail.com
on 17 Feb 2012 at 4:04
I came up with another solution. MSVC has the packed pragma, so I did the
following:
#if defined(_MSC_VER) && defined(_ARM_) &&
(!defined(LZ4_FORCE_UNALIGNED_ACCESS))
#pragma pack(push)
#pragma pack(1)
#endif
typedef struct _U64_S
{
U64 v;
} _PACKED U64_S;
typedef struct _U32_S
{
U32 v;
} _PACKED U32_S;
typedef struct _U16_S
{
U16 v;
} _PACKED U16_S;
#if defined(_MSC_VER) && defined(_ARM_) &&
(!defined(LZ4_FORCE_UNALIGNED_ACCESS))
#pragma pack(pop)
#endif
To keep the aligned stuff in one place, I modified the code as follows:
#if (defined(__GNUC__) && (!defined(LZ4_FORCE_UNALIGNED_ACCESS)))
#define _PACKED __attribute__ ((packed))
#define _START_PACKING
#define _END_PACKING
#elif (defined(_MSC_VER) && defined(_ARM_) &&
(!defined(LZ4_FORCE_UNALIGNED_ACCESS)))
#define _PACKED
#define _START_PACKING __pragma( pack(push, 1) )
#define _END_PACKING __pragma( pack(pop) )
#else
#define _PACKED
#define _START_PACKING
#define _END_PACKING
#endif
I then replaced the if/pragma/endif blocks above with _START_PACKING and
_END_PACKING
(To be honest, I'm still not sure why this works. It would seem that you could
still get misaligned accesses.)
Original comment by joewoodb...@gmail.com
on 17 Feb 2012 at 4:28
Thanks for the hint.
I'll look into your proposal.
Original comment by yann.col...@gmail.com
on 17 Feb 2012 at 5:40
I just ran both ideas on one of our CE devices. Your code was faster.
Original comment by joewoodb...@gmail.com
on 17 Feb 2012 at 6:25
There is an interesting entry on MSDN which might explain why :
http://msdn.microsoft.com/en-us/library/aa290049(v=vs.71).aspx
(notably fig.2)
Original comment by yann.col...@gmail.com
on 17 Feb 2012 at 9:01
Also : was the speed difference you mentioned significant ?
I like your proposed solution because it makes code maintenance much easier. on
the other hand, if the difference is large, it might be worth going the tricky
"manual optimization" route.
Original comment by yann.col...@gmail.com
on 17 Feb 2012 at 9:33
For information
you will find in attached file
a proposed update rc56
which features a correction for MSVC on Strictly Aligned CPUs
inspired by your suggestion.
Please feel free to comment
Original comment by yann.col...@gmail.com
on 17 Feb 2012 at 9:50
Attachments:
On an OMAP4430 600 Mhz with the Visual Studio 2008 ARMV4 compiler set for speed
optimization, I decoded a compressed bitmap from 22 KB to 750 KB (these bitmaps
have a lot of white space.) I ran 11 iterations (11 so I could correlate it
with other tests) and ran the test several times. The "manual optimization"
took 0.097-0.111 seconds, the pragma route took 0.135-0.154 seconds.
Original comment by joewoodb...@gmail.com
on 17 Feb 2012 at 10:46
OK, so it's significant.
Do you know if there is some kind of pre-defined macro which tells if the
target CPU supports Unaligned memory access or not ?
Something like this will be needed to branch the code towards the "manual"
memory routines.
In the meantime, please find another rc56 version, which is still based on the
same "packed" idea and should be a bit more generic (code is shorter, and
unaligned memory access can be "forced" as before).
Original comment by yann.col...@gmail.com
on 17 Feb 2012 at 10:58
Attachments:
Hi Joe
I currently plan to release r56 with the generic correction which applies to
both GCC and Visual, in order to get a clean starting point. There are also a
few other minor improvements that i want to release early.
Beyond that point, i understand that hand-made code for strict-align CPU is
likely to improve performance for such systems. My only concern is that i need
a proper test environment to develop, verify and benchmark code modifications,
which are going to be significant.
The closest thing to your use case i could find around is an old HTC phone
(Touch 3G) with Windows CE. My main issue however is that i don't know how to
setup a development environment like yours, to build a C program for it. Is
there any tutorial to create one, or even a ready-to-use environment (such as a
VM) to download ?
Regards
Original comment by yann.col...@gmail.com
on 20 Feb 2012 at 10:28
To develop for CE, you need Visual Studio 2005 or 2008 professional with smart
device support installed. For generic testing, you can generally use the Pocket
PC 2003 device which is CE 4.2 and pretty much the minimum platform these days.
Visual Studio comes with an emulator which is useful when first creating a CE
application. The ARM emulator seems fairly consistent with actual devices.
You can debug to the device with ActiveSync (in Windows 7, they call it the
Windows Mobile Device Center.) Either way, sometimes it just doesn't work right
and I just have to copy files to a device and run them.
Original comment by joewoodb...@gmail.com
on 20 Feb 2012 at 4:41
Thanks for detailed advises.
It seems it's going to cost some time to follow all these requirements,
nonetheless i will try.
Regards
Original comment by yann.col...@gmail.com
on 21 Feb 2012 at 4:10
The compilation problem with Visual Studio for ARM is solved in r56, thanks to
the #pragma suggestion. I keep in my task list the need to optimise some
routines for improved speed on strict-aligned CPUs.
Original comment by yann.col...@gmail.com
on 21 Feb 2012 at 4:12
Original issue reported on code.google.com by
joewoodb...@gmail.com
on 17 Feb 2012 at 1:07