Code optimization - Githubissues

JBontes commented 8 years ago

Hi Dmitry,

I heard about your excellent work on Google+ Delphi.
I'd like to suggest a couple of tweaks, see below.

I've only touched the x64 code, except where otherwise stated.

fixed jumps to naked ret: causes pipeline to be emptied on AMD wasting 25! cycles.
Added FlushInstructionCache to secure self modifying code segments, see: http://stackoverflow.com/questions/7581321/how-to-make-fastcodepatch-work-in-delphi-xe2-win64-platform
Changed bsf to rep bsf (aka TZCNT) which is much faster on processors that support it and reverts to plain bsf on processors that don't.
Added a faster version of ZeroMem than STOSQ. STOSx is rather slow except for very large fills.
Reordered some code to speed up on processors (ATOM) that do not support OoO execution.
Complex LEA takes 2 cycles, replaced it with a single cycle shl where applicable.
replaced NOP's with no-op prefixes where possible, this helps prediction accuracy on AMD because AMD stores instruction meta-data in the cache and if there are too many small instructions in a row meta-data drops out.
replaced a few repeated nops with multi-byte nops for the same reason.

Kindly let me know what you think.

Regards,

-- Johan Bontes

d-mozulyov commented 8 years ago

Hello, Johan!

It's great to see your pull request to my memory manager project! Sorry for my English - it is not very good yet :)

So. Now I'm working for new Medium pieces management. Old is too slow in many cases. So it is a large block of new code, I need a lot of time, hope a few weeks. I would like to see your optimization fixes request next time, when you'll look and fix a new Medium management code too.

But I want to comment and ask you about a few things.

I heard about your excellent work on Google+ Delphi.

May I see that page? :)

fixed jumps to naked ret: causes pipeline to be emptied on AMD wasting 25! cycles.

Do you mean ret --> rep ret? Where can I read about this feature?

Added FlushInstructionCache to secure self modifying code segments, see:

FlushInstructionCache function: "FlushInstructionCache is not necessary on x86 or x64 CPU architectures as these have a transparent cache".

Changed bsf to rep bsf (aka TZCNT) which is much faster on processors that support it and reverts to plain bsf on processors that don't.

How can I detect, is the feature supported by my CPU? I want to test a few cases and be sure, it's a really great replacement.

Reordered some code to speed up on processors (ATOM) that do not support OoO execution.

I need to read about it. Give me please a good article )

Complex LEA takes 2 cycles, replaced it with a single cycle shl where applicable.

I thought complex LEA is a full: base, scale and offset. Not base+scale / scale+offset / base+offset.

Added a faster version of ZeroMem than STOSQ. STOSx is rather slow except for very large fills.

It will need to refactory AllocMem function later. Becouse big/large allocations will take empty Windows pages and it will be no reason to fill zero allocated memory. Can I see some x86/x64 bencmarks of ZeroMem16? In what size and CPU it has really good profit.

replaced a few repeated nops with multi-byte nops for the same reason.

You have replaced two nops after jmp instruction. But no problem :)

P.S. I hope end the next week - will be README.md in English.

JBontes commented 8 years ago

From: d-mozulyov [mailto:notifications@github.com] To: d-mozulyov/BrainMM [mailto:BrainMM@noreply.github.com] Cc: JBontes [mailto:johan@digitsolutions.nl], Author [mailto:author@noreply.github.com] Sent: Sat, 11 Jun 2016 10:13:22 +0100 Subject: Re: [d-mozulyov/BrainMM] Code optimization (#1)

Hello, Johan!

It's great to see your pull request to my memory manager project! Sorry for my English - it is not very good yet :)

So. Now I'm working for new Medium pieces management. Old is too slow in many cases. So it is a large block of new code, I need a lot of time, hope a few weeks. I would like to see your optimization fixes request next time, when you'll look and fix a new Medium management code too.

But I want to comment and ask you about a few things.

I heard about your excellent work on Google+ Delphi.

May I see that page? :)https://plus.google.com/communities/103113685381486591754 More specifically: https://plus.google.com/107597312175322360727/posts/4y74MCr6esZ

fixed jumps to naked ret: causes pipeline to be emptied on AMD wasting 25! cycles.

Do you mean ret --> rep ret? Where can I read about this feature?http://stackoverflow.com/questions/20526361/what-does-rep-ret-mean

Added FlushInstructionCache to secure self modifying code segments, see:

FlushInstructionCache function: "FlushInstructionCache is not necessary on x86 or x64 CPU architectures as these have a transparent cache".

Changed bsf to rep bsf (aka TZCNT) which is much faster on processors that support it and reverts to plain bsf on processors that don't.

How can I detect, is the feature supported by my CPU? I want to test a few cases and be sure, it's a really great replacement.You don't have to detect, the "rep" will ignored on older CPU's and they will just see a BSF/BSR

Reordered some code to speed up on processors (ATOM) that do not support OoO execution.

I need to read about it. Give me please a good article )Don't really have it, but: https://en.wikipedia.org/wiki/Out-of-order_execution

Complex LEA takes 2 cycles, replaced it with a single cycle shl where applicable.

I thought complex LEA is a full: base, scale and offset. Not base+scale / scale+offset / base+offset.A complex LEA is: Lea reg,[reg+reg+const] or Lea reg,[reg_scale+reg] or Lea reg,[reg+reg_scale+const]

Added a faster version of ZeroMem than STOSQ. STOSx is rather slow except for very large fills.

It will need to refactory AllocMem function later. Becouse big/large allocations will take empty Windows pages and it will be no reason to fill zero allocated memory. Can I see some x86/x64 bencmarks of ZeroMem16? In what size and CPU it has really good profit.On my AMD K10 it runs about 50% faster than STOSQ.
Although for really large blocks the difference asymptocally goes to zero.

replaced a few repeated nops with multi-byte nops for the same reason.

I have replaced two nops after jmp instruction. But no problem :)

P.S. I hope end the next week - will be README.md in English.Drop me a line, I'd be happy to proofread it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

d-mozulyov commented 8 years ago

You don't have to detect, the "rep" will ignored on older CPU's and they will just see a BSF/BSR

I mean I want to test TZCNT instruction on my CPU. And I have to know, is the feature supported on my CPU.

A complex LEA is: Lea reg,[reg+reg+const] or Lea reg,[reg_scale+reg] or Lea reg,[reg+reg_scale+const]

But not lea rcx, [rcx * 8]. Is not it?

(ATOM) that do not support OoO execution... Don't really have it, but:

I mean about ATOM and OoO. And why your reordering will be faster.

On my AMD K10 it runs about 50% faster than STOSQ.

Do you have something like that for x86?

d-mozulyov commented 8 years ago

ENG: https://github.com/d-mozulyov/BrainMM/blob/development/README.md

d-mozulyov commented 8 years ago

Hello :) It is the first official stable release: https://github.com/d-mozulyov/BrainMM Waiting your pull requests

d-mozulyov / BrainMM

Code optimization #1