Allocation of non-zeroed memory for performance boost

straight-shoota commented 2 weeks ago

The default malloc implementation GC.malloc is supposed to zero the allocated memory. This is a sane default choice because it ensures a clean starting field. The GC won't risk incorrectly assuming uninitialized memory to contain live pointers.

But zeroing out memory has a runtime cost. And it's often useless because the memory gets fully initialized and all zeroed bits are overridden right away (e.g: https://github.com/crystal-lang/crystal/pull/14671#discussion_r1632103195).

Performance could be improved if there was a way to avoid zeroing when not necessary.

A potential implementation is proposed in #14679

ysbaddaden commented 2 weeks ago

As stated in #14679:

I'd like to avoid introducing calloc(size) when the C function is calloc(num, size). I understand we don't care about the num arg, but that's introducing a discrepancy over a weighted name, and I'd prefer a self explanatory name for the "memory is set to zero" guarantee (for non C programmers).

I'm all in for a single GC.malloc(size, *, atomic = false, clear = true) method (and an new __crystal_malloc3(size, atomic, clear) fun). I mean, atomic is just a flag in my immix GC and it never sets memory to zero because crystal always does (when it matters).

ysbaddaden commented 2 weeks ago

Anyway, having a mean to say "allocate and set to zero" and "allocate and don't set to zero" is needed for Crystal to tell it's intent to the allocator's interface, since each allocator has different guarantees:

C: never zeroes (must use calloc);
Boehm GC: GC_malloc always zeroes, GC_malloc_atomic never;
Immix GC: never zeroes (DIY) but I could add an arg to do it at the most opportune moment.

HertzDevil commented 2 weeks ago

If we bypass the CRT and use Win32's HeapAlloc then there is also an optional argument specifying whether the memory should be cleared or not. On Windows LibC.malloc is only used in -Dgc_none and in the wmain entry point though

beta-ziliani commented 2 weeks ago

I'm all in for a single GC.malloc(size, *, atomic = false, clear = true) method (and an new __crystal_malloc3(size, atomic, clear) fun)

We can't really make it just one function, because that will break backwards compatibility. But yes, we can have this generic version and keep the "atomic" version deprecated.

straight-shoota commented 2 weeks ago

@beta-ziliani Backwards compatibility is trivial. The compiler knows which functions (or methods) are defined and can make calls appropriately.

BlobCodes commented 2 weeks ago

I'm all in for a single GC.malloc(size, *, atomic = false, clear = true) method (and an new __crystal_malloc3(size, atomic, clear) fun).

Ideally, as a follow-up PR to #14679, I wanted to further optimize the allocation methods using LLVM's allockind function attribute. Using this attribute, we can for example tell LLVM that the memory returned from an allocation function is always zeroed out - which allows LLVM to optimize most default ivar values away (@x = nil, @y = 0).

If we only have one allocation function with lots of parameters, we can't fully specify the allocation behaviour to LLVM and can't fully optimize the code.

I'd like to avoid introducing calloc(size) when the C function is calloc(num, size).

We could also make the default malloc methods always clean their memory and add new malloc_dirty methods (or something similar).

ysbaddaden commented 2 weeks ago

@BlobCodes oh, allockind looks very nice :+1:

ysbaddaden commented 2 weeks ago

I investigated memory allocations in codegen, and there are only a few calls:

Codegen#malloc is called by Codegen#malloc_closure and Codegen#malloc_aggregate.
Codegen#malloc_atomic is only called from Codegen#allocate_aggregate.
Codegen#array_malloc and Codegen#array_malloc_atomic are only used to implement Pointer.malloc.

Of these use cases:

Codegen#malloc_closure: I don't know if it needs the memory to be initialized to zero (it currently does or doesn't because of #14678) => I believe this is part of this issue to determine this;
Codegen#allocate_aggregate: it needs the memory to be initialized to zero (and may do it twice => :bug:).
Pointer(T).malloc: the memory must be initialized to zero (whatever if T has inner pointers)... ~~and here comes another bug: #malloc_atomic currently returns uninitialized memory => :bug:~~ (nope, not a bug).

So I don't think we have a need to distinguish a "please zero memory" vs "don't zero memory". Having the four Codegen#[array_]malloc[_atomic] methods always return initialized memory feels like it would be enough.

~~Then a check in #malloc_aggregate to avoid zeroing twice to fix this issue.~~

BlobCodes commented 2 weeks ago

and there's a bug since #malloc_atomic currently returns uninitialized memory!

https://github.com/crystal-lang/crystal/blob/835b0c7ec5945c48ddd65f856cd81338ccbe7f65/src/compiler/crystal/codegen/codegen.cr#L2174

Then a check in #malloc_aggregate to avoid zeroing twice to fix this issue.

That would fix #14677, but not this issue. This issue is about completely removing the zeroing in situations where all memory is overwritten anyway.

Also, we can't just add a check in malloc_aggregate and array_malloc to avoid the unnecessary zeroing since old stdlibs (<= 1.12.2) have an invalid GC.malloc implementation which doesn't clear the memory when compiling with -Dgc_none. This is also the reason I chose to add completely separate internal allocation functions which actually guarantee the memory is cleared, manually clearing it in codegen if unavailable.

ysbaddaden commented 2 weeks ago

Urgh, sorry. I mixed up the issues. I'll move my comment :bow:

Note: I fail to see the older stdlib problem: a 1.12 or older compiler using the stdlib from 1.13+ (for example to compile Crystal 1.13 itself) will use the __crystal_malloc64 fun and GC.malloc def from the 1.13+ stdlib that would now zero the memory. The worst to happen is the same as introducing new funs: the first time we compile Crystal 1.13 using Crystal 1.12, the compiler itself will keep zeroing the memory twice (but not the programs it compiles), and the compiler should recompile itself.

It would be an issue if using a 1.13 compiler with an older stdlib, but I would be very surprised if we supported that. We'd have the same issue with new funs that would end up calling the fallback LibC functions instead of GC methods :fearful: Definitely not supported.

straight-shoota commented 1 week ago

It would be an issue if using a 1.13 compiler with an older stdlib, but I would be very surprised if we supported that. We'd have the same issue with new funs that would end up calling the fallback LibC functions instead of GC methods :fearful: Definitely not supported.

Yes, this is indeed a relevant use case. I suppose it's mostly for compiler/stdlib development. You can go back and build old versions with the current compiler. Of course it's not super important to ensure production-grade stability. But still worth considering. Currently we can go back and build Crystal 1.8 with a 1.12 compiler. Older versions fail due to incorrectly mangled intrinsics after https://github.com/crystal-lang/crystal/pull/13164

crystal-lang / crystal

Allocation of non-zeroed memory for performance boost #14687