cynthia / gperftools

Automatically exported from code.google.com/p/gperftools
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

[patch] Use intrinsic InterlockedXxx funcs #353

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,

We can use intrinsic functions for all InterlockedXxx funcs in 
atomicops-internals-x86-msvc.h.
intrinsic Interlock functions are about 30% faster than non intrinsic version.

Yoni,

Original issue reported on code.google.com by jontra@gmail.com on 28 Jun 2011 at 12:12

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks for the suggestion!  But I'm confused; in my cursory research, I saw 
this:
   http://msdn.microsoft.com/en-us/library/ms683609%28v=vs.85%29.aspx
   "This function is implemented using a compiler intrinsic where possible. For more information, see the Winbase.h header file and _InterlockedExchangePointer."

That's what I would expect: in places where the intrinsic could be used, a 
windows header file would use it automagically.  Is that not happening here?  
If not, why not?

Original comment by csilv...@gmail.com on 28 Jun 2011 at 8:10

GoogleCodeExporter commented 9 years ago
Hi,

Well, with Microsoft, you don't always get what you expect.
Read winbase.h and look for X86 InterlockXxx definitions (line 2254nnnnnn).
Another way is to look at the generated assembly - I did that, and the call is 
NOT intrinsic.
My patch solve it for all architectures.

Yoni.

Original comment by jontra@gmail.com on 29 Jun 2011 at 5:09

GoogleCodeExporter commented 9 years ago
I looked at winbase.h, and it looks like it uses a compiler intrinsic for 
64-bit architectures but not 32-bit.  I guess what I'm wondering is: does this 
mean the intrinsic is not safe for 32-bit architectures (perhaps just 
sometimes, or in rare situations)?  I'd like to understand more why winbase.h 
is written the way it is, and why it's safe to do something different from it 
here.

Original comment by csilv...@gmail.com on 29 Jun 2011 at 5:31

GoogleCodeExporter commented 9 years ago
it is perfectly safe to call intrinsic functions on x86.
InterlockXxx implementation does eventually "lock xadd ...", but we have the 
overhead of calling a function (push param, call func, push ebp, get param, 
ret...) - I counted 13 assembly directives.
When we use the intrinsic version, it just prepare the register and do "lock 
xadd..." (3 directives).
Same call - less work.
Look at the assembly and see for yourself.

Original comment by jontra@gmail.com on 29 Jun 2011 at 5:53

GoogleCodeExporter commented 9 years ago
} Look at the assembly and see for yourself.

I'm not a windows programmer and don't have interest in becoming one.  I know 
it seems self-evident to you that this patch is safe, but it's not to me.  Can 
you point me to the InterlockXxx source code where it does a lock xadd?  Are 
you confident that's always what it does in all situations?

What I'm getting at is: what is the reason that winbase.h doesn't use the 
intrinsic on x86?  Maybe it's a bad reason and we can ignore it.  But I want to 
at least know what the reason is before I decide to sidestep what the winbase.h 
authors decided was safe.

Original comment by csilv...@gmail.com on 29 Jun 2011 at 6:09

GoogleCodeExporter commented 9 years ago
In earlier versions of the CL compiler they did not have an intrinsic 
implementation for the InterlockedXXX functions. So instead of use asm(), then 
called a function in kernel32.dll. Later on when they added it to CL, they 
didn't update winbase.h accordingly for 32bit.

here is InterlockedIncrement impl in kernel32.dll:
75A0BBC0  mov         edi,edi 
75A0BBC2  push        ebp  
75A0BBC3  mov         ebp,esp 
75A0BBC5  pop         ebp  
75A0BBC6  jmp         75A0BB10 
75A0BB10  ecx,dword ptr [esp+4] 
75A0BB14  mov         eax,1 
75A0BB19  lock xadd   dword ptr [ecx],eax 
75A0BB1D  inc         eax  
75A0BB1E  ret         4    

and this is the intrinsic version:
009D19FB  lea         eax,[i] 
009D19FE  mov         ecx,1 
009D1A03  lock xadd   dword ptr [eax],ecx 

As you can see, they are identical, except for function call overhead.

Original comment by jontra@gmail.com on 29 Jun 2011 at 7:40

GoogleCodeExporter commented 9 years ago
OK, thanks for the explanation.  This still seems weird to me, but I don't know 
much about windows development, so I'll just assume this is the kind of thing 
that happens.  I'll look to get this into the next release.

Original comment by csilv...@gmail.com on 29 Jun 2011 at 7:44

GoogleCodeExporter commented 9 years ago
> This still seems weird to me, but I don't know much about windows development

It's not weird, it's just regular Microsoft behavior :-)
Trust me, I have seen much weirder stuff on win32.

> I'll look to get this into the next release.

Great!

Original comment by jontra@gmail.com on 29 Jun 2011 at 8:10

GoogleCodeExporter commented 9 years ago
I looked to try this patch, and ran into a problem compiling on MSVC 7.1 (which 
we still try to support in perftools) -- it doesn't seem to have an intrin.h.  
I guess I'm happy to go with the slower routines on older compilers.  Is the 
patch you propose good for MSVC 8.0 and later, do you know?  Also, what's the 
proper _MSC_VER check for that?

Original comment by csilv...@gmail.com on 8 Jul 2011 at 5:06

GoogleCodeExporter commented 9 years ago
I looked it up: _MSC_VER should be >= 1400, I think.  I'll just add a check for 
that.

My next problem: #including <intrin.h> to lead to compile problems:
   http://connect.microsoft.com/VisualStudio/feedback/details/262047

I definitely have <windows.h> in the compilation units here, so I don't see a 
good way of avoiding this problem.  The way I've seen others handle this is to 
just forward-declare the intrinsics they use themselves.  This depends on the 
API being stable for these between compiler versions.  Do you know if that's 
the case?  Is this safe to rely on?

Original comment by csilv...@gmail.com on 8 Jul 2011 at 10:53

GoogleCodeExporter commented 9 years ago
I'll make a patch that only uses intrinsics where I can get it to work.  It 
should be most places, in practice.

Original comment by csilv...@gmail.com on 12 Jul 2011 at 1:06

GoogleCodeExporter commented 9 years ago
Great :-)

Original comment by jontra@gmail.com on 12 Jul 2011 at 8:19

GoogleCodeExporter commented 9 years ago
Should be done in perftools 1.8, just released.

Original comment by csilv...@gmail.com on 16 Jul 2011 at 1:17