Closed GoogleCodeExporter closed 8 years ago
Thanks, that's great find. I'll look into it.
You seem to find new optimisations with a relative ease...
Original comment by yann.col...@gmail.com
on 6 Dec 2012 at 12:56
Enhancement request.
Original comment by yann.col...@gmail.com
on 6 Dec 2012 at 1:34
mmmh, this one is less successfull.
No gain with VS2008.
No loss either. It's really equivalent.
I will test more configurations to see if it gets better or worse on different
compilers / OS.
Original comment by yann.col...@gmail.com
on 7 Dec 2012 at 9:46
VS 2010 : gain of about 0.5%.
This is very small, almost within margin error.
Still, no loss either.
Original comment by yann.col...@gmail.com
on 7 Dec 2012 at 10:09
OK, i checked the Assembler output,
and both codes are translated the same on VS2010.
They are in fact identical.
Your version is closer to the assembler result :
000eb 0f b6 10 movzx edx, BYTE PTR [eax]
000ee 88 11 mov BYTE PTR [ecx], dl
000f0 0f b6 50 01 movzx edx, BYTE PTR [eax+1]
000f4 88 51 01 mov BYTE PTR [ecx+1], dl
000f7 0f b6 50 02 movzx edx, BYTE PTR [eax+2]
000fb 88 51 02 mov BYTE PTR [ecx+2], dl
000fe 0f b6 50 03 movzx edx, BYTE PTR [eax+3]
00102 88 51 03 mov BYTE PTR [ecx+3], dl
This picked my curiosity, and i tried to reduce the number of instructions,
by merging the final addition into the table.
Hence, instead of doing :
op[3] = ref[3];
op += 4, ref += 4;
ref -= dec[op-ref];
the code would become :
op[3] = ref[3];
ref += inc32[op-ref];
op += 4;
I checked the assembler output, and indeed, the second version is much shorter :
Version 1 :
000fe 0f b6 50 03 movzx edx, BYTE PTR [eax+3]
00102 88 51 03 mov BYTE PTR [ecx+3], dl
00105 83 c1 04 add ecx, 4
00108 83 c0 04 add eax, 4
0010b 8b d1 mov edx, ecx
0010d 2b d0 sub edx, eax
0010f 2b 44 95 dc sub eax, DWORD PTR _dec$[ebp+edx*4]
00113 eb 0a jmp SHORT $LN12@LZ4_uncomp@2
Version 2 :
00101 0f b6 58 03 movzx ebx, BYTE PTR [eax+3]
00105 03 44 bd dc add eax, DWORD PTR _inc32$[ebp+edi*4]
00109 88 59 03 mov BYTE PTR [ecx+3], bl
0010c 83 c1 04 add ecx, 4
0010f eb 0a jmp SHORT $LN12@LZ4_uncomp@2
In spite of the much reduced instruction sequence, version 2 is in fact slower
than v1, by a sizable amount (~3%).
Well, sometimes, this is almost mystery...
Original comment by yann.col...@gmail.com
on 8 Dec 2012 at 12:11
I use VS2012, it is not the same for me. With postincrement:
*op++ = *ref++;
004021FA mov al,byte ptr [edx]
004021FC mov byte ptr [ecx],al
004021FE lea eax,[ecx+1]
*op++ = *ref++;
00402201 movzx ecx,byte ptr [edx+1]
00402205 mov byte ptr [eax],cl
*op++ = *ref++;
00402207 movzx ecx,byte ptr [edx+2]
0040220B mov byte ptr [eax+1],cl
*op++ = *ref++;
0040220E movzx ecx,byte ptr [edx+3]
00402212 mov byte ptr [eax+2],cl
00402215 add eax,3
00402218 add edx,4
Without postincrement:
op[0] = ref[0];
004021FA movzx eax,byte ptr [edx]
004021FD mov byte ptr [ecx],al
op[1] = ref[1];
004021FF movzx eax,byte ptr [edx+1]
00402203 mov byte ptr [ecx+1],al
op[2] = ref[2];
00402206 movzx eax,byte ptr [edx+2]
0040220A mov byte ptr [ecx+2],al
op[3] = ref[3];
0040220D movzx eax,byte ptr [edx+3]
op[3] = ref[3];
00402211 mov byte ptr [ecx+3],al
op += 4, ref += 4;
00402214 lea eax,[ecx+4]
00402217 add edx,4
Without postincrement has one less instruction.
Original comment by strig...@gmail.com
on 8 Dec 2012 at 12:14
I agree it's hard to tell a difference in speed.
Original comment by strig...@gmail.com
on 8 Dec 2012 at 12:20
Well, since your version seems at least equivalent, if not sometimes better,
there's no problem in using it within LZ4.
I'm still puzzled as to why the tested "second version" is actually slower,
despite of it being shorter.
I also note that the reduced number of instructions partly comes because v1
needs to recalculate "op-ref" (within eax) :
0010b 8b d1 mov edx, ecx
0010d 2b d0 sub edx, eax
0010f 2b 44 95 dc sub eax, DWORD PTR _dec$[ebp+edx*4]
While for v2, the result seems "ready to use" (within edi):
00105 03 44 bd dc add eax, DWORD PTR _inc32$[ebp+edi*4]
Original comment by yann.col...@gmail.com
on 8 Dec 2012 at 12:28
Enhancement integrated into r85.
Original comment by yann.col...@gmail.com
on 11 Dec 2012 at 1:41
Original issue reported on code.google.com by
strig...@gmail.com
on 5 Dec 2012 at 7:21