dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.77k forks source link

Proposal: TryForSufficientStack method to support stackalloc usage #24963

Closed jkotas closed 4 years ago

jkotas commented 6 years ago

From @kkokosa on February 8, 2018 12:8

Due to changes in C# 7.2 and Span, more and more stackallocusages may become popular like:

Span<byte> span = stackalloc byte[1000];

However, this call will end up with unrecoverable StackOverflowExceptionif there is not enough stack space left. We can do nothing in such situation which makes this approach not useful at some point. It is now just completely unreliable to guess what stackallocsize may end up with such a catastrophe.

@jkotas pointed out in dotnet/corefx#14675 that RuntimeHelpers.EnsureSufficientExecutionStack is a reliable solution for handling stack overflow in general but as MSDN says, this method "ensures that the remaining stack space is large enough to execute the average .NET Framework function". However, probing for average .NET framework function is not very helpful as stackalloc makes it not average for sure.

I propose to add a new helper method which gives at least some clue whether our stackallocmay end up with StackOverflowException:

public static bool RuntimeHelpers.TryForSufficientStack(long size)

I believe returning bool instead of throwing an exception (like InsufficientExecutionStackException from above method) is better because stackalloc is most probably used in hot paths already and adding exception handling there is rather undesirable.

As far as I understand this method seems to be quite simple in terms of implementation as all necessary data are there already. My naive implementation proposal:

FCIMPL1(FC_BOOL_RET, ReflectionInvocation::TryForSufficientStack, INT64 size)
{
    FCALL_CONTRACT;

    Thread *pThread = GetThread();

    UINT_PTR current = reinterpret_cast<UINT_PTR>(&pThread);
    UINT_PTR limit = reinterpret_cast<UINT_PTR>(pThread->GetCachedStackLimit());

    FC_RETURN_BOOL(current >= (limit + size));
}
FCIMPLEND

PS. I am not sure whether stack guard size should be taken into consideration here or not...

Copied from original issue: dotnet/coreclr#16277

svick commented 6 years ago
stephentoub commented 6 years ago

Should there be a way to use the new method to check for stack space for that "average function" without throwing?

FYI, there already is a public RuntimeHelpers.TryEnsureSufficientExecutionStack() exposed in .NET Core.

svick commented 6 years ago

there already is a public RuntimeHelpers.TryEnsureSufficientExecutionStack() exposed in .NET Core.

Then I think the method proposed here should be added as its overload.

kkokosa commented 6 years ago

there already is a public RuntimeHelpers.TryEnsureSufficientExecutionStack() exposed in .NET Core.

Then I think the method proposed here should be added as its overload.

Then probably complementary EnsureSufficientExecutionStack overload (throwing InsufficientExecutionStackException) should be added?

jkotas commented 6 years ago

If the proposed EnsureSufficientExecutionStack is added what the typical usage is going to look like?

Note that stackalloc should be only used for relatively small buffers (say up to 1kB). Array pool is more appropriate above this limit. The framework is using stackalloc in number of places and the typical pattern looks like this:

https://github.com/dotnet/coreclr/blob/efebb38f3c18425c57f94ff910a50e038d13c848/src/mscorlib/shared/System/Text/ValueStringBuilder.cs https://github.com/dotnet/coreclr/blob/efebb38f3c18425c57f94ff910a50e038d13c848/src/mscorlib/shared/System/Number.Formatting.cs#L297

Would it be better to work on making this pattern easier instead?

kkokosa commented 6 years ago

For example, I was thinking about various scenarios where some temporary data collections have to be calculated. Instead of using an array or even ArrayPool, we could utilize stack with almost no time overhead:

public double Process(List<BigData> list)
{
    Span<DataStruct> preprocessedList = stackalloc DataStruct[list.Count];
    for (int i = 0; i < list.Count; ++i)
    {
        // Calculate preprocessedList [i] fields from list items
    }
    return ProcessData(new ReadOnlySpan<DataStruct>(preprocessedList ));
}

But currently every usage of stackallow above MinExecutionStackSize (from Thread::SetStackLimits method) - which is 128 kB (or 64 kB on 32-bit) - may end up with killing entire app with StackOverflowException.

I mean, I know it is quite a big size guaranteed already by EnsureSufficientExecutionStack call but... would not be nice to have some more size-aware way of checking this? Currently, people using stackalloc may be even not aware of those deep hidden implementation-related limitations. In other words, deciding whether we are using "relatively small buffers" may be too subjective when writing general-purpose library and there is no way currently to ensure that stackalloc won't fail.

jkotas commented 6 years ago

So what would your code example look like if you had EnsureSufficientExecutionStack API?

jkotas commented 6 years ago

Instead of using an array or even ArrayPool, we could utilize stack with almost no time overhead:

Large stack buffers do have overhead: The stack memory has to be faulted in (one page at a time typically), and then it tends to be stuck in the working set.

kkokosa commented 6 years ago

So what would your code example look like if you had EnsureSufficientExecutionStack API?

Something like:

public double Process(List<BigData> list)
{
    if (RuntimeHelpers.TryEnsureSufficientExecutionStack(list.Count * PUT_HERE_YOUR_STRUCT_SIZE))
    {
        Span<DataStruct> preprocessedList = stackalloc DataStruct[list.Count];
        for (int i = 0; i < list.Count; ++i)
        {
            // Calculate preprocessedList [i] fields from list items
        }
        return ProcessData(new ReadOnlySpan<DataStruct>(preprocessedList ));
    }
    else
    {
        // fallback to ArrayPool or other way of creating preprocessed list
    }
}

Large stack buffers do have overhead: The stack memory has to be faulted in (one page at a time typically), and then it tends to be stuck in the working set.

Isn't the same for faulting in memory from ArrayPool? I mean, if Process is on hot path called multiple times, stucking in working set is some kind of overhead we should be aware of but maybe worth considering? If Process was one-shot call, probably nobody would care about optimizing it in the first place.

Additionally, the point is that even with small buffers of 1 kB the TryEnsureSufficientExecutionStack says nothing whether it is sufficient for that 1 kB or maybe some other value.

jkotas commented 6 years ago

// fallback to ArrayPool or other way of creating preprocessed list

I do not think I would want to use this pattern because of it makes me duplicate the code.

Isn't the same for faulting in memory from ArrayPool?

No - if the buffer is reused on multiple threads, or different callers on the same thread.

There needs to be best practice for allocating large buffers. If half of the folks are going to do it via ArrayPool and other half with large stack allocs, everybody loses at the end.

kkokosa commented 6 years ago

I do not think I would want to use this pattern because of it makes me duplicate the code.

Ok, this was a poorly prepared example. I was thinking more about something like 'generic temporary collection factory' without repeating any processing logic:

public double Process(List<BigData> list)
{
    Span<SomeStruct> span;
    if (RuntimeHelpers.TryEnsureSufficientExecutionStack(list.Count * PUT_HERE_YOUR_STRUCT_SIZE))
    {
        span = stackalloc SomeStruct[size];
    }
    else
    {
        span = ArrayPool<SomeStruct>.Shared.Rent(size);
    }

    for (int i = 0; i < list.Count; ++i)
    {
        // Calculate span[i] fields from list items
    }   
    return ProcessData(new ReadOnlySpan<DataStruct>(span));
}

However, this currently does not compile with "A result of a stackalloc expression of type 'Span' cannot be used in this context because it may be exposed outside of the containing method" error. Is it supposed to be ok? Because https://github.com/dotnet/csharplang/blob/master/proposals/csharp-7.2/span-safety.md says that "A stackalloc expression is an rvalue that is safe-to-escape to the top-level scope of the method (but not from the entire method itself)".

There needs to be best practice for allocating large buffers. If half of the folks are going to do it via ArrayPool and other half with large stack allocs, everybody loses at the end.

Here I agree fully and I see your point. What still bothers me is the fact that stackalloc usage may kill any app with stack overflow and the defensive programmer has no way to proactively prevent this.

jkotas commented 6 years ago

Right, it does not compile. Also, missing in your example is code to return the buffer to the pool once you are done. The pattern that works today looks like this:

https://github.com/dotnet/coreclr/blob/efebb38f3c18425c57f94ff910a50e038d13c848/src/mscorlib/shared/System/Text/ValueStringBuilder.cs https://github.com/dotnet/coreclr/blob/efebb38f3c18425c57f94ff910a50e038d13c848/src/mscorlib/shared/System/Number.Formatting.cs#L297

Or another example:

https://github.com/dotnet/corefx/pull/26942/files#diff-a07e741d448a2771beaea007ab99a266 https://github.com/dotnet/corefx/pull/26942/files#diff-00816d3a470b2701102e6be001c6e4d3R23

benaadams commented 6 years ago

Though its likely unsensible to do on a different thread than the one you are running on (unless in a Wait?); how about some utility apis for querying stack size instead?

So currently there is a maxStackSize parameter for creating a thread:

public Thread (ParameterizedThreadStart start, int maxStackSize);

But you can't query it back to find out what it was set to; so maybe something like the following would be better?

public partial class Thread
{
    public int TotalStackSize { get; }
    public int AvailableStackSize { get; }
    public int UsedStackSize { get; }
}

Its also less ambiguous with guarantees that xSufficientStack would suggest

benaadams commented 6 years ago

Or perhaps in System.Diagnostics?

namespace System.Diagnostics.ThreadInfo
{
    public static class ThreadInfoExtensions
    {
        public static int TotalStackSize(this Thread thread);
        public static int AvailableStackSize(this Thread thread);
        public static int UsedStackSize(this Thread thread);
    }
}
kkokosa commented 6 years ago

Or perhaps in System.Diagnostics?

Moving it away from "mainstream" APIs like Thread seems to be a good idea. And it seems to be a better place than quite exotic System.Runtime.CompilerServices.

jkotas commented 6 years ago

It does not make sense for these methods to take Thread because of they are applicable for current thread only. For historic reasons, there are number of methods like that on Thread that throw when called on anything but current thread. We do not want to be growing this set.

svick commented 6 years ago

Would it be more convenient if the user didn't have to compute the required size in bytes, they would just specify the type and the count of items of that type to allocate? So instead of:

RuntimeHelpers.TryEnsureSufficientExecutionStack(list.Count * PUT_HERE_YOUR_STRUCT_SIZE)

You would have:

RuntimeHelpers.TryEnsureSufficientExecutionStack<SomeStruct>(list.Count)

Though being able to specify the size in bytes is probably still useful, so this could be just a convenience overload (possibly added later than the one proposed here).

ygc369 commented 6 years ago

I like this idea.

airbreather commented 6 years ago

I do appreciate this idea, but my intuition suggests that it's a lot less useful than it seems on its face.

Thinking about when you'd need a really big buffer (like, bigger than a 4K OS page) that could legally fit on the stack, allocating that buffer on the stack feels like it could very often cause a page fault when trying to return to the caller. Since we're probably talking about situations where these scratch buffers are of input-dependent lengths, then it seems to follow that larger buffers will correlate with more time spent in the method, which means that I see a potential for a bit of a "perfect storm":

I'd imagine that someone could produce a contrived microbenchmark that shows this as being even worse than unconditionally allocating on the heap; more likely, each method will have its own threshold where it starts making sense to use ArrayPool<T>, and I imagine that this point will usually be well below "however much stack space there currently is".

Furthermore, it feels like as soon as we write a TryEnsureSufficientExecutionStack-guarded stackalloc, everything after that becomes "serious business" until the stack-allocated buffer goes out of scope:

I only bring up the above because several examples in the thread use TryEnsureSufficientExecutionStack as the only thing that determines stack vs. heap.

Finally, what are some actual situations that we're worried about here?

Again, though, I do appreciate the idea, and the sample in the original issue description seems to suggest that it's easy to implement. I just wanted to make sure to share my thoughts on it (it's been on my mind a ton since @svick linked it from dotnet/roslyn#25118), even if all it does is inspire a "Remarks" section in the docs.

jkotas commented 6 years ago

Completely agree with @airbreather . Similar discussion at https://github.com/dotnet/csharplang/issues/1344

kkokosa commented 6 years ago

I really like the direction of this discussion. My primary reasons for this proposal were twofold:

At the moment, I fully agree with the arguments presented by you. However, the fact that such stackalloc usage is most probably, in most cases, not optimal does not incur that people will not do it. Maybe a good remark in a documentation should state clearly what this magic number is and it will be enough. Currently, I would be a little afraid to advise the client to use stackalloc and when asked what size does not kill his application, the only answer is "well... probably something around 1 KiB".

I like @jkotas idea of LocalBuffer<T> a lot as a way of addressing this problem. It would leave stackallocating-people on their own but at least they would have a safer alternative.

svick commented 6 years ago

@airbreather

This means that the code in question needs to consider not just the size of their desired scratch buffer, but also the size of everything else that will be pushed onto the stack after it (including other methods being called, and their own calls, etc.) and encode all of that knowledge into a number that goes into the TryEnsureSufficientExecutionStack call.

EnsureSufficientExecutionStack already has an opinion about how much stack is required to execute "average .NET Framework function". Maybe you could tell TryEnsureSufficientExecutionStack how many functions you need to fit on the stack after this allocation? That way, you wouldn't be encoding it yourself and any magic numbers would be hidden from you. The code could look like:

RuntimeHelpers.TryEnsureSufficientExecutionStack(
    size: list.Count * STRUCT_SIZE, calledMethods: 10)
Drawaes commented 6 years ago

There are a couple of issues that feed into this. One is that the stack size varies widely on platforms now (for instance 2mb on windows and a default of 8mb on Ubuntu). Also I am not sure about allocating large chunks of stack until we get the "don't zero the stack". Without that its faster to pull from the array pool because zeroing a large array will cost you more and as said it will need to be faulted in anyway.

airbreather commented 6 years ago

EnsureSufficientExecutionStack already has an opinion about how much stack is required to execute "average .NET Framework function". Maybe you could tell TryEnsureSufficientExecutionStack how many functions you need to fit on the stack after this allocation? That way, you wouldn't be encoding it yourself and any magic numbers would be hidden from you.

@svick looking through the code, the rule-of-thumb used by EnsureSufficientExecutionStack seems like it's already accounting for methods calling other methods at a "typical" rate (plus the overhead of CLR services), so I'm guessing that the implementation would be more like "return true iff the stack budget burned so far, plus the directly desired size, plus that rule-of-thumb value for the current thread, would be less than the current thread's max stack size".

On the one hand, that certainly feels like it would be safe enough, but on the other hand, is it maybe a bit too safe? If I'm a leaf method that's 50KiB away from overflowing the stack (which might actually start happening because of callers over-eagerly stack-allocating because of this guard), and this guard stops me from allocating 400B, then I'm not happy.

Currently, I would be a little afraid to advise the client to use stackalloc and when asked what size does not kill his application, the only answer is "well... probably something around 1 KiB".

@kkokosa Without further context, if all you're given is a question phrased like "how much can I stackalloc before my application explodes?", then the perfect answer seems to depend both on how much stack is already burned by the time your method is called and how much stack the method will burn after the stackalloc (including any of the CLR services that might get invoked automatically like the GC).

Anyway, without that context, isn't the best answer still going to be something like that? "Probably something around 1 KiB"? The helper utility would only help to remove the "stack burned so far" part, not the "stack burned after the stackalloc" part.

It's almost like the more useful helper would be to just give a utility method that answers "how many more bytes can I push onto the stack before I hit the guard page?" (edit: or was it the page that comes immediately "before" the true guard page?) and just leave it up to the caller to interpret that according to their specific knowledge of what they intend to do in the future.

iSazonov commented 6 years ago

In PowerShell repo we got first experience of using Span and stackalloc. We found that we use one pattern:

const stackallocTheshold = 120;

Span<int> array;

int size = GetArraySize();
if (size < stackallocTheshold)
{
    array = stackalloc int[size];
}
else
{
    array = new int[size]
}

Based on the discussion we would have to implement:

const RedLineStackallocTheshold = 1Kb; // ???
const StackallocTheshold = 120;

Span<int> array;

int size = GetArraySize();
if (size < StackallocTheshold && AvailableStackSize() > RedLineStackallocTheshold)
{
    array = stackalloc int[size];
}
else
{
    array = new int[size]
}

RedLineStackallocTheshold protects from stack overflow. StackallocTheshold is important - it guarantees optimal use of the stack because we use stackalloc in nested functions. This value can only be selected based on the properties of the application itself. In Powershell repo we use 120 - this is the maximum width of the screen by default and we are sure that in most scenarios the application will use the stack, but if the user will set extremely large parameters, the application will continue work using heap.

Fallback to heap can be interesting proposal for the discussion.

GSPP commented 5 years ago

Maybe the method should return the number of bytes left. That way the caller can apply any logic they like.

This would simplify the API as well. No need for a size parameter and comparison.

jkotas commented 4 years ago

https://github.com/dotnet/runtime/issues/25423 is a better proposal solving the same problem.