d-mozulyov / BrainMM

Extremely fast memory manager for Delphi
http://www.sql.ru/forum/1213139/ekstremalno-bystryy-menedzher-pamyati-brainmm
MIT License
96 stars 16 forks source link

Code optimization #1

Closed JBontes closed 8 years ago

JBontes commented 8 years ago

Hi Dmitry,

I heard about your excellent work on Google+ Delphi.
I'd like to suggest a couple of tweaks, see below.

I've only touched the x64 code, except where otherwise stated.

Kindly let me know what you think.

Regards,

-- Johan Bontes

d-mozulyov commented 8 years ago

Hello, Johan!

It's great to see your pull request to my memory manager project! Sorry for my English - it is not very good yet :)

So. Now I'm working for new Medium pieces management. Old is too slow in many cases. So it is a large block of new code, I need a lot of time, hope a few weeks. I would like to see your optimization fixes request next time, when you'll look and fix a new Medium management code too.

But I want to comment and ask you about a few things.

I heard about your excellent work on Google+ Delphi.

May I see that page? :)

fixed jumps to naked ret: causes pipeline to be emptied on AMD wasting 25! cycles.

Do you mean ret --> rep ret? Where can I read about this feature?

Added FlushInstructionCache to secure self modifying code segments, see:

FlushInstructionCache function: "FlushInstructionCache is not necessary on x86 or x64 CPU architectures as these have a transparent cache".

Changed bsf to rep bsf (aka TZCNT) which is much faster on processors that support it and reverts to plain bsf on processors that don't.

How can I detect, is the feature supported by my CPU? I want to test a few cases and be sure, it's a really great replacement.

Reordered some code to speed up on processors (ATOM) that do not support OoO execution.

I need to read about it. Give me please a good article )

Complex LEA takes 2 cycles, replaced it with a single cycle shl where applicable.

I thought complex LEA is a full: base, scale and offset. Not base+scale / scale+offset / base+offset.

Added a faster version of ZeroMem than STOSQ. STOSx is rather slow except for very large fills.

It will need to refactory AllocMem function later. Becouse big/large allocations will take empty Windows pages and it will be no reason to fill zero allocated memory. Can I see some x86/x64 bencmarks of ZeroMem16? In what size and CPU it has really good profit.

replaced a few repeated nops with multi-byte nops for the same reason.

You have replaced two nops after jmp instruction. But no problem :)

P.S. I hope end the next week - will be README.md in English.

JBontes commented 8 years ago

From: d-mozulyov [mailto:notifications@github.com] To: d-mozulyov/BrainMM [mailto:BrainMM@noreply.github.com] Cc: JBontes [mailto:johan@digitsolutions.nl], Author [mailto:author@noreply.github.com] Sent: Sat, 11 Jun 2016 10:13:22 +0100 Subject: Re: [d-mozulyov/BrainMM] Code optimization (#1)

Hello, Johan!

It's great to see your pull request to my memory manager project! Sorry for my English - it is not very good yet :)

So. Now I'm working for new Medium pieces management. Old is too slow in many cases. So it is a large block of new code, I need a lot of time, hope a few weeks. I would like to see your optimization fixes request next time, when you'll look and fix a new Medium management code too.

But I want to comment and ask you about a few things.

I heard about your excellent work on Google+ Delphi.

May I see that page? :)https://plus.google.com/communities/103113685381486591754 More specifically: https://plus.google.com/107597312175322360727/posts/4y74MCr6esZ

fixed jumps to naked ret: causes pipeline to be emptied on AMD wasting 25! cycles.

Do you mean ret --> rep ret? Where can I read about this feature?http://stackoverflow.com/questions/20526361/what-does-rep-ret-mean

Added FlushInstructionCache to secure self modifying code segments, see:

FlushInstructionCache function: "FlushInstructionCache is not necessary on x86 or x64 CPU architectures as these have a transparent cache".

Changed bsf to rep bsf (aka TZCNT) which is much faster on processors that support it and reverts to plain bsf on processors that don't.

How can I detect, is the feature supported by my CPU? I want to test a few cases and be sure, it's a really great replacement.You don't have to detect, the "rep" will ignored on older CPU's and they will just see a BSF/BSR

Reordered some code to speed up on processors (ATOM) that do not support OoO execution.

I need to read about it. Give me please a good article )Don't really have it, but: https://en.wikipedia.org/wiki/Out-of-order_execution

Complex LEA takes 2 cycles, replaced it with a single cycle shl where applicable.

I thought complex LEA is a full: base, scale and offset. Not base+scale / scale+offset / base+offset.A complex LEA is: Lea reg,[reg+reg+const] or Lea reg,[reg_scale+reg] or Lea reg,[reg+reg_scale+const]

Added a faster version of ZeroMem than STOSQ. STOSx is rather slow except for very large fills.

It will need to refactory AllocMem function later. Becouse big/large allocations will take empty Windows pages and it will be no reason to fill zero allocated memory. Can I see some x86/x64 bencmarks of ZeroMem16? In what size and CPU it has really good profit.On my AMD K10 it runs about 50% faster than STOSQ.
Although for really large blocks the difference asymptocally goes to zero.

replaced a few repeated nops with multi-byte nops for the same reason.

I have replaced two nops after jmp instruction. But no problem :)

P.S. I hope end the next week - will be README.md in English.Drop me a line, I'd be happy to proofread it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

d-mozulyov commented 8 years ago

You don't have to detect, the "rep" will ignored on older CPU's and they will just see a BSF/BSR

I mean I want to test TZCNT instruction on my CPU. And I have to know, is the feature supported on my CPU.

A complex LEA is: Lea reg,[reg+reg+const] or Lea reg,[reg_scale+reg] or Lea reg,[reg+reg_scale+const]

But not lea rcx, [rcx * 8]. Is not it?

(ATOM) that do not support OoO execution... Don't really have it, but:

I mean about ATOM and OoO. And why your reordering will be faster.

On my AMD K10 it runs about 50% faster than STOSQ.

Do you have something like that for x86?

d-mozulyov commented 8 years ago

ENG: https://github.com/d-mozulyov/BrainMM/blob/development/README.md

d-mozulyov commented 8 years ago

Hello :) It is the first official stable release: https://github.com/d-mozulyov/BrainMM Waiting your pull requests