MasonProtter / Bumper.jl

Bring Your Own Stack
MIT License
152 stars 6 forks source link

Fix typos and less cautious language #2

Closed PallHaraldsson closed 1 year ago

PallHaraldsson commented 1 year ago

The new less "cautious" (I had a typo) is just a proposal (can you redo the PR without the typo... or your better language). I looked at the code and it seemed absolutely safe for single-threaded. At first it seemed absolutely not for multi-threaded, then I noticed:

Every task has its own independent default buffer which inherit the size of their parent’s task buffer.

I don't know, does that mean safe for all multi-threaded use? And note, I see also on master (I think not yet merged), you can add (C, foreign) threads at runtime.

I believe with "no throw" version also safe, or well just on you to make sure. And not really a good reason to scare people too much since almost never needed? Only (currently) for StaticCompiler.jl and at least there (currently) no threads, so not a worry there, or if at all.

PallHaraldsson commented 1 year ago

Also it seems you do not allocate on the CPU stack (one option). It wouldn't actually be faster I think), nor really safer, but would mean you're not limited by one global size (or well the CPU stack, which might be less).

But got me thinking, would you rather want to extend YOUR stack than throw? Julia is very good at extending arrays. It could be the default, and opt-in to not allow it, or vice versa. And might it be better to start with a non-zero size?

MasonProtter commented 1 year ago

Thanks Pall! I'm overly cautious about the language mostly because Julia programmers are not used to having to reason about whether or not a pointer escapes a block, and I also don't yet have a great test suite to give me a lot of confidence that there isn't some glaring issue somewhere.

I don't know, does that mean safe for all multi-threaded use? And note, I see also on master (I think not yet merged), you can add (C, foreign) threads at runtime.

At least currently, all multithreading involves the creation of a new Task. I think (but I'm not sure) that even if we had foreign thread support, we'd still be creating Tasks on those threads.

Also it seems you do not allocate on the CPU stack (one option). It wouldn't actually be faster I think), nor really safer, but would mean you're not limited by one global size (or well the CPU stack, which might be less).

Yeah, I more or less made this because of how frustrating it is to dynamically allocate on the stack in julia, and I figured this was a good alternative.

But got me thinking, would you rather want to extend YOUR stack than throw? Julia is very good at extending arrays. It could be the default, and opt-in to not allow it, or vice versa.

Good point! Originally, I was using statically sized AllocBuffers, but since I've switched to Vector, I might as well just support auto-resizing.

And might it be better to start with a non-zero size?

I'm not sure. Maybe yeah that's the right thing to do. I'm not yet sure what would be a great default sized, but I guess it doesn't matter as much if we auto-resize.

PallHaraldsson commented 1 year ago

I reviewed the new code. It's good to see you listen to my suggestion. About starting with 1 MB, it seems ok if it's just the global buffer. I did review your code, but I'm not sure if that also applies per Task, then maybe excessive?

I saw:

length(b.buf) + max(2sz, 1_000)

Seems too complex, and too much to double. You're never going to add only 1000 bytes to 1MB because the lower level will resize by more. I think you could do: length(b.buf) + sz + sz>>2 You only need to resize by a (small) fraction. I think the point of your code is that this should likely happen never anyway, and if it's too small of a constant then it doesn't matter too much. Julia does similar (less than 2x except first), and you may send some redundant resize! which will be fast since already made larger at a lower level.

I'm also not sure, but I think your code would work if you used it recursively. But might it then matter to not grow too fast?

MasonProtter commented 1 year ago

Yeah, I wasn't really sure what would be a good resize! heuristic. My concern is that people hitting resize! too often would suffer performance penalties, but I suppose people can always take the resize!ing into their own hands if they really want to avoid that penalty... I got the 2x from my recollection about how push! works, but I didn't realize it only doubled the first time.

Maybe I'll do sz + 2 * (sz>>2) or even sz + 4 * (sz>>2) so that many small resizes don't get penalized too badly, but we don't blow it up to huge sizes for bigger resizes. Any thoughts on that?

PallHaraldsson commented 1 year ago

Sorry, first few times (or rather up to some size), I could dig it up.

sz + 2 * (sz>>2) would defeat the purpose of >>2, Python does >>1 if I recall, i.e. 1.5x larger, and was considered too large a factor for Julia.

MasonProtter commented 1 year ago

Bleh, yeah of course. For some reason I wasn't thinking clearly and was thinking that x >> 2 was like sqrt(x), not x/2. Yeah, that's probably fine.

PallHaraldsson commented 1 year ago

Would you be opposed to have this package part of Julia as a stdlib? It's of course not up to me...

I realize it's going to be registered soon, but I think that's not contradictory, likely a good step, get some adoption first, prove it's really helpful to many people/packages.

I want some other added also, see you have a pattern matching package, and there are many, unclear to me which is best, to add as the default one or maybe with new keyword support. For this package probably no changes would be needed, macros best, and I think we couldn't rely on the compiler doing similar work (we don't want on the CPU stack in general), nor new keywords needed or preferred.

The criteria (one of, or the the main one) for inclusion, is: is it useful for Julia itself? You often see allocations, e.g. when compiling code for the first time, and I just have a feeling it would actually help there. I just haven't yet looked into where. Do you know if Julia shows all allocations? I suppose the optimizer, LLVM, might do some.

I suppose Julia COULD also use Match.jl or your or: https://github.com/RelationalAI-oss/Rematch.jl

do you know if the last one is the major one? Yours hasn't been updated in a while, does it simply mean it's very stable, not outdated?

PallHaraldsson commented 1 year ago

And FYI, I don't regret the expansion idea if it actually works, but I'm thinking it might not...

What I have in mind is, some function uses the Bumper, it allocates, then calls another function, and it also allocates, i.e. actually in a way meaning the underlying buffer needs to be enlarged (which isn't always so testing wouldn't uncover this bug reliably). I didn't review your code too carefully, I see you used pointers. I'm thinking the actual array will be moved if it's enlarged, which might be ok for the latter function, but then when you return, I foresee the original function using and writing over where the array was originally, but that memory is no longer in use by it (or worse by something new).

I would hate it if I induced you to make a new buggy feature. If you're unsure you could undo the change, at least for now, to win some time since will be registered soon (and might already have users). The feature can be added later, wouldn't be a breaking change (except for amortized, which you could document right away in that case).

I think it might be able still be made to work or does already, if you thought about it. Can you at least confirm if you did and it works? If unsure, I'll expand on how it might need fixing if you're not clear on that.

Another issue, a minor one is that allocation was O(1), now is O(1) amortized, which is good enough for most (e.g. for Julia in case adopted), but not for (hard)-real-time. I just think it, i.e. amortized , may need to be documented. It's not like hard-real-time is a big use-case of Julia (also strictly needs a real-time OS; or such additional support for e.g. Linux, which I believe is available just not the default). For those that want non-amortized, you just have to think ahead and allocate large enough (though not even clear the memory would be mapped in). Such users might want no resize! rather to get an exception (though likely not in production... only while developing, to know how large is needed, so the warnings you get might be enough).

Your pending registration is for a certain commit (and "Version: v0.2.0"), i.e. after the change I suggested. Is it too late to fix that somehow, in case needed? Since it's the first version to be registered does it mean users can't get the older v0.1.0? I'm not sure, if that's the case then you could do a safe v0.3.0. I think there's always a loophole to check out a specific commit (and while possible to yank versions, likely that loophole still applies).

MasonProtter commented 1 year ago

What I have in mind is, some function uses the Bumper, it allocates, then calls another function, and it also allocates, i.e. actually in a way meaning the underlying buffer needs to be enlarged (which isn't always so testing wouldn't uncover this bug reliably). I didn't review your code too carefully, I see you used pointers. I'm thinking the actual array will be moved if it's enlarged, which might be ok for the latter function, but then when you return, I foresee the original function using and writing over where the array was originally, but that memory is no longer in use by it (or worse by something new).

Oops! I should have checked this, I thought that resize! would respect pointer identity but it doesn't! This is why I need to write more tests :sweat_smile:, good catch! I'll remove the misfeature now, and see if there's a way we can make it work reliably. Here's the crux of the problem:

julia> let v = rand(10)
           p1 = pointer(v)
           resize!(v, 100000)
           p1 === pointer(v)
       end
false

I don't see any easy ways around this, so for now I'll just undo it and forward the undone version for registration

MasonProtter commented 1 year ago

Would you be opposed to have this package part of Julia as a stdlib? It's of course not up to me...

I certainly wouldn't be opposed, but I don't see that happening. Probably instead we'll someday get better utilities for the language to automatically stack allocate non-escaping Vectors so that this isn't necessary at all.

The package would have to mature a lot and be battle tested before I even considered asking for it to be included in the language proper.

PallHaraldsson commented 1 year ago

I thought that resize! would respect pointer identity but it doesn't!

Exactly. I see two-three ways around this. a) Do you really need to use pointers? Could you rather use indices into the array? It seems good enough for Julia... and the solution for this. Also in C++ for std::vector, which can grow too... unless in either case you take a pointer and use it for too long.

b) Another possible way I see is: If pointers are really needed, then keep around the original allocations. But let's say your enlargement policy is 2x, then you keep your 1x + 2x = 3x, and next time, you when you double your new 2x, then 2x + 4x = 6x plus the original 1x etc. So 7x instead of 4x, or 75% overhead. I believe the limit of this series is 2x memory overhead, not too bad, not 2x slower, just memory sitting around. If you start with 1MB (or whatever other carefully chosen fixed value) they you may never go to that overhead in many cases (though 1 MB in itself is an overhead in many cases, likely largely unused). I think this will also mean lower e.g. 1.5x poly will have more overhead, and conversely 4x smaller.

c) On 64-bit CPUs, you should have plenty of address space. Just as the stack is mapped somewhere where it can grow large and doesn't have to me moved, I suppose something similar could be done for a few arrays. I'm just not sure there's a portable way.

Whatever you do, even if you don't move, you do "allocate" within your array, don't forget alignment is important, what I also mentioned (in the still open) other issue.

PallHaraldsson commented 1 year ago

Probably instead we'll someday get better utilities for the language to automatically stack allocate non-escaping Vectors so that this isn't necessary at all.

I doubt that (not for Vectors, only structs).

I (and I know @elrod) would really like to see alloca, but it's only helpful for smaller allocations.

"The default Linux stack size per thread is 10MB, where as, on Windows the default is 1MB."

Historically CPU stacks were smaller (I recall 256 bytes...), or non-existent... The default size (for new operating systems or versions) may grow over time, who knows, but you must not depend on more than the lowest-common-denominator.

I know at least on Linux you can ask for a larger stack, you don't want to have the user do that. Or maybe Julia can do it at runtime? Anyway, I would think such is system dependent if at all possible.

Plus on the CPU stack you would have the overhead from b) above, e.g. 2x if growable (and if not might as well be on the heap).

I see on Wikipedia:

A safer version of alloca called _malloca, which reports errors, exists on Microsoft Windows. It requires the use of _freea.[6] [..] In addition, since the C version C99 (optional since C11), it is possible to create an array on the stack within a function, automatically, known as an auto VLA (variable length array).[9]

There's some reason VLAs are no longer in the standard as of C11.

MasonProtter commented 1 year ago

Could you rather use indices into the array?

I've tried before with a design that uses reinterpret and view over an array instead of pointer loading, but it's significantly slower than the current design.