Expose Thread.GetCurrentProcessorId() as a public API

danmoseley commented 7 years ago

@VSadov commented on Thu Feb 16 2017

Environment.CurrentExecutionId seems to be a generally useful API, that can be used to correlate accesses to striped data in order to minimize unwanted cross core sharing.

Currently the best correlation option is the CurrentManagedThreadId which helps with continuity of accesses to the same stripe within a thread, but does not help much with preventing unwanted stripe sharing from threads running on separate cores.

@VSadov commented on Thu Feb 16 2017

Re: https://github.com/dotnet/coreclr/blob/e5f2df239b546ac9ed2417506c90af222eaf1013/src/mscorlib/src/System/Environment.cs#L709

@jkotas commented on Fri Feb 17 2017

Agree that it looks useful.

@VSadov Process for adding new public APIs https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md

Close the issue here, and open a new one in corefx with the API proposal.

@gkhanna79 commented on Sat Mar 04 2017

@danmosemsft Can you please move this to CoreFX?

joperezr commented 7 years ago

@VSadov do you want to make a formal Api Proposal for it so that it goes to review?

VSadov commented 7 years ago

@joperezr - yes, I want to write a formal proposal, just did not have time for that lately.

VSadov commented 7 years ago

It is often useful to associate separate portions of shared data with workers running on a different CPU cores. The goal is to minimize the penalty of false sharing.

A typical example would be a striped counter where workers would increment an instance of a sub-counter that is specific to the given execution context. A sub-counter would still use Interlocked.Add to ensure that increments are not dropped, but as long as the writes are not observed from other cores this would be very cheap. At the end of the entire job sub-counts are aggregated together into the final value - that will require cross-core memory accesses, but we have managed to avoid that expense while actively counting.

A practice of striping the scratch working set can have a dramatic impact on performance of concurrent code.

The Problem

The workers need to be associated with stripes.

Currently the best correlation option is the CurrentManagedThreadId. That does help with continuity of accesses to the same stripe while running on the same core, mostly because migration of a given thread to a different CPU is statistically rare. In fact the OS, tries to keep same threads running on same cores, specifically to maintain the locality of the memory references.

However this option is not ideal when there are more workers than stripes or when the workers do get rescheduled.

In theory one does not need more stripes than the number of cores, while the number of workers is often larger due to pooling or intentional oversubscription (a common practice to hide latencies due to syncronization).

Another, bigger, problem is the initial associating of a thread to a stripe. The thread ID is just a number with an open range. Reducing that to a number between 0 and ProcessorCount is going to be lossy and prone to collisions.
Lets say I have a thread with ID 42 and another thread with ID 50. If I derive the stripe ID from that by simply masking lowest 3 bits (makes sense on 8 core system), I will be matching both threads to the same stripe. If both threads get to run at the same time, I will have a serious sharing problem.

A correlation ID that is specific to the CPU core that the thread is running on at the time of the request would allow much more precise matching of workers to the nearest stripes (in terms of CPU cache).

Proposed APIs

 int Environment.GetCurrentProcessorNumber()

The API simply returns the ID of the currently executing CPU core, which is a number between 0 and Environment.ProcessorCount.

In a case where underlying OS does not support the necessary API (such as sched_getcpu), the result should be -1 indicating that the ID is not available and the caller needs to resort to other ways of correlating, like the thread ID or forgo the correlating altogether.

Implementation Details and other concerns.

In NUMA/hyperthreaded configurations not all cores are equally distanced and there could be scenarios sensitive to the core/cache topology so some users of GetCurrentProcessorNumber may want hat information as well. I suggest to wait until such a need arises and introduce an additional API for that purpose when/if needed.

jkotas commented 7 years ago

It is unlikely that any OS implementation of the underlying API is slow

The implementations of the underlying OS APIs tend to be slow. You pretty much always have to do caching to make it reasonable.

considering the purpose is synchronization.

You cannot ever use this API for synchronization. It is not reliable for synchronization. You can be running on a different core by the time you got the result. You can only ever use this API for performance optimizations.

the result should be -1 indicating that the ID is not available and the caller needs to resort to other ways of correlating

I do not that this is useful. I think that the runtime should approximate it (e.g. by assigning random numbers to different threads) if the underlying OS does not support it.

VSadov commented 7 years ago

CC: @joperezr @danmosemsft @jkotas

VSadov commented 7 years ago

@jkotas - i have misused "synchronization" there. What I meant is to introspect into things related to threading. - Basically the OS method would need to be reasonably fast to be useful at all.

VSadov commented 7 years ago

Also CC: @vancem

VSadov commented 7 years ago

My point for returning -1 as a failure is that I can see when user might want to know if numbers are fake. Providing a separate API to query whether the CurrentProcessorNumber is real, seems too heavy.

VSadov commented 7 years ago

I do not have a very strong opinion on the caching part. If the underlying OS implementations do tend to be slow, caching might be acceptable.

It would be less then ideal, but would still be better than CurrentManagedThreadId

jkotas commented 7 years ago

My point for returning -1 as a failure is that I can see when user might want to know if numbers are fake.

What would the user code do about this case? I believe the only reasonable option is to generate their own fake numbers. And so the API can just do it for them.

VSadov commented 7 years ago

@jkotas - there are multiple ways to fake the number. Make it 0always, mask low bits of ManagedThreadId, or perhaps hash/rotate/mix the ThreadId before masking to make it more random and remove possible bias. That would result in fewer collisions, but add extra cost...

I am just afraid we may not be able to provide a good-for-all solution.

jkotas commented 7 years ago

add extra cost...

The implementation of CurrentExecutionId has to have a cache for the mainline case anyway because of calling the OS each time is too expensive. An extra one-time cost for the fallback case does not matter.

provide a good-for-all solution.

The key value prop of .NET libraries is to provide good-enough-for-all solutions.

We can either have GetCurrentProcessorNumber() API that is slow, calls the underlying OS, returns -1 on failure.

And/or we can have CurrentExecutionId that caches the result, refreshes it once in a while; fakes the Id if the OS does not have the GetCurrentProcessorNumber.

Hybrid is not interesting.

VSadov commented 7 years ago

Would it be possible to have both GetCurrentProcessorNumber() and CurrentExecutionId ?

VSadov commented 7 years ago

CurrentExecutionId seems something that most people will want for CPU-specific data and we can implement it efficiently.

GetCurrentProcessorNumber() has pretty clear semantics and can be used for something other than managing CPU-specific data. Or can be used to implement alternative to CurrentExecutionId if user is unhappy with ours - does not like caching, but ok with OS costs, etc...

jkotas commented 7 years ago

Would it be possible to have both GetCurrentProcessorNumber() and CurrentExecutionId ?

Sounds fine to me.

vancem commented 7 years ago

This is OK, some feedback:

I am a big fan of NOT exposing extensibility points until we have real scenarios that need them. It is not clear that GetCurrentProcessorNumber() has a REAL scenario (someone who REALLY can't use CurrentExecutionId). What I have seen is that there is a theoretical case where CurrentExecutionId may not be enough and some user wants more control, but it is also definitely possible that no such user will ever show up.

My suggestion is that unless we REALLY know of sometone who needs is we hide GetCurrentProcessorNumber() for now, and expose it when it matters. It seems likely that benefit of them using GetCurrentProcessorNumber() over CurrentExecutionId won't be large (thus we really do need a real scenario and real perf measurements to decide, another reason to wait. (Note that we KNOW CurrentExecutionId is useful because we use it internally already and measured that it helps perf for the BufferPool logic).

My other suggestion is that CurrentExecutionId is a bad name, I know that we don't want to call it CurrentProcessorId, because it might be synthetic, but frankly that just makes it hard to understand what it is. My suggestion is to simply use 'CurrentProcessorId' (with documentation that indicates that it is approximate). If that seems too much like lying then call it 'ApproximateCurrentProcessorId'. My point is that you use it like a processor number, so you really should use 'Processor' in its name.

I am very supportive of exposing this API however. It will be very useful as we make our data structures scale to many-core processors.

VSadov commented 7 years ago

@vancem - if I have to pick one out of two proposed APIs, my choice would definitely be GetCurrentProcessorNumber().

I can implement CurrentExecutionId on top of that. Possibly better than the current implementation, because I have extra data relevant to my code. - My number of stripes is not always == ProcessCount, since it scales in response to contentions - I would have to mask ThreadID differently. I need to use timer anyways, so I could age the cache in terms of milliseconds, not in terms of accesses.

jkotas commented 7 years ago

@VSadov If you need to implement your own custom scheduling, you can PInvoke the underlying OS API yourself to get exactly what you need.

I agree with @vancem that exposing GetCurrentProcessorNumber has questionable value.

VSadov commented 7 years ago

Unfortunately GetCurrentProcessorNumber() is more likely to be what I need.

vancem commented 7 years ago

@VSadov - if you really know that you would use GetProcessorNumber() in an important scenario in CoreFX or similar important library (that is we really will ship it) and we have measured important performance/scalability value over CurrentExecutionId, then we should certainly include it.

I am not against GetProcessorNumber, I just know that we often create APIs that probably don't provide enough value (or are slightly the wrong thing). Thus I am simply suggesting that we do it in two pieces. Expose the part that we KNOW we need now and expose the other part when we know we need that.

I am suspicious that CurrentExeuctionId provides 90+? of the value in a very nice way, and so I am suspicious of the need for more AT THIS TIME. Is it so bad that we wait?

VSadov commented 7 years ago

I am often against redundant APIs, but GetCurrentProcessorNumber would simply expose something that every major OS provides.

The point is that CurrentExecutionId can be implemented on top of GetCurrentProcessorNumber and possibly better than the default implementation since user can use extra data from his app. The other way does not work, because CurrentExecutionId is a lossy API. And the lossy part is tuned for a very specific task - the buffer pool in CoreFX.

Basically, I do not think GetCurrentProcessorNumber is one of those strange/exotic/unsafe APIs that we will later regret exposing.

vancem commented 7 years ago

As I mentioned, I am not against GetCurrentProcessorNumber() if will have real users. It sounds like you might be one already. If you

Need this in code you are making that we will ship
The straightforward solution using CurrentExecutionId leads to either notably poorer scalability.

Then we are done, we should add the GetCurrentProcessorNumber() API. Are we there? If not, I am STILL not against it, I am just arguing that since we ship every 3 months, it would not be hard to add this when we CAN validate 1 and 2 above.

vancem commented 6 years ago

I happen to stumble across this issue in a search and wanted to do a update.

We have added System.Environment.CurrnetExecutionId to the System.Environment class. It gives you a small integer number that represents a guess as to what CPU processor you are running on. Because fetching the current processor number is expensive, and can't be perfectly accurate anyway (since the underlying OS scheduler can migrate the thread at any time), it caches the result and updates it a a relatively slow pace (but likely to be accurate most of the time since the OS doe not migrate threads often).

The intended use of this is for caches (you cache per CPU). This covers the most common use of GetProcessorNumber().

As mentioned above, we can add the GetProcessorNumber() but we should NOT do so unless we find compelling scenarios where it is useful and GetExecutionId can't cover the case.

We may wish to close this issue until such a scenario surfaces.

jkotas commented 6 years ago

We have added System.Environment.CurrentExecutionId to the System.Environment class

This method is internal. We have not added it to the public surface yet.

Drawaes commented 6 years ago

Can we have this please? It seems a little uncool to presume that MS is the only one that would ever need to partition data like this and not allow others access to it doing interop to the OS is very slow and this solution would be perfect for a number of scenarios I have.

stephentoub commented 6 years ago

seems a little uncool

I'm not arguing against exposing it, but what prevents anyone else from implementing a similar solution? Nothing here requires integration into the runtime.

Drawaes commented 6 years ago

The get currentprocessornumber is the bit I am talking about. It's intrinsic and internal.

stephentoub commented 6 years ago

And gets its data from public OS APIs.

Again, not saying it shouldn't be exposed, but the lack of it being exposed shouldn't block you if it's that valuable.

Drawaes commented 6 years ago

Uncool was probably a bit harsh :) Anyway the thing I am talking about is

internal static extern int CurrentProcessorNumber
{
    [MethodImplAttribute(MethodImplOptions.InternalCall)]
    get;
}

The rest of the code I am happy to sort out. But it would be nice to have that exposed.

(FYI I did, but then you have to deal with xplat)

So I used reflection to grab it out and then write my own caching function over the top. It's just a bit messy.

jkotas commented 6 years ago

then write my own caching function over the top

Is your caching strategy different from the caching strategy used by Environment.CurrentExecutionId ?

Drawaes commented 6 years ago

No, just copy pasta. I think the caching strategy is fine that is there. TBH if you only exposed the cached one I would be happy with that as well. I just don't want to have to implement the OS API calls for each OS.

Drawaes commented 6 years ago

As a note of reference. This has always been an issue for us as we use big multi chip machines that are heavily data (not cacheable sizes) read dependent for things like Monte Carlo simulations.

In the past it has been easiest to spin up multiple processes and lock them to numa nodes. Then use IPC to combine the results. This is one way of handling this problem.

So we have always handled it in that case. However its becoming more of a "general" issue and we want to use libraries that are numa aware or core groupings of some description. If you look at AMD ThreadRipper and the latest high core count Intel Xeons they are Numa inside the chips, let alone big multi chip machines.

I think this API is a classic case of the framework/runtime being able to provide a consistent view of a property of the underlying system without having to understand how each OS deals with it.

Just my 2c for what its worth

Drawaes commented 6 years ago

One question on the caching logic. If you did expose it and someone uses the Cache badly and in a tight loop you might end up causing the ArrayPool functions to get hit with a lot of slow api calls. So maybe its better not to expose that cache?

Just a thought

jkotas commented 6 years ago

If you did expose it and someone uses the Cache badly and in a tight loop you might end up causing the ArrayPool functions to get hit with a lot of slow api calls.

I do not think that this is a problem. I think it is a good thing to have a central cache because of then everybody gets more recent and accurate value, and the cost gets amortized over all users.

Drawaes commented 6 years ago

Yeah thats a good point too. So in that case exposing the cached one is a good idea (as well or only again I don't mind).

jkotas commented 6 years ago

Proposed API:

public static class Environment
{
    public int CurrentProcessorId { get; }
}

Rationale and Usage

The API returns approximate current processor id. It allows partitioning and affinitizing data to physical processors. https://github.com/dotnet/corefx/issues/16767#issuecomment-289932408 has more detailed description of the problem.

This API exists as internal in CoreLib today (the current name is Environment.CurrentExecutionId). It is used internally by ArrayPool and Timer. We just need to rename it and expose it as public API.

jkotas commented 6 years ago

This is be pretty simple. Can it make 2.1 ?

VSadov commented 6 years ago

I think this would be a nice thing to have. 👍

terrajobst commented 6 years ago

Video

Makes but we believe it (1) should be a method and (2) live on Thread.

namespace System.Threading
{
    public partial class Thread
    {
        public static int GetCurrentProcessorId();
    }
}

Drawaes commented 6 years ago

Is current processor id the cached one, or the one that calls directly into the OS API. I am happy to do my own caching, just @jkotas made a good point that the cost/frequency can be amortized across users

VSadov commented 6 years ago

Any chance this will be using RDTSCP x86 instruction where available? It may not need be cached then.

benaadams commented 6 years ago

RDTSCP is more expensive than LOCK CMPXCHG? Also can fault based on privilege and proc mode?

jkotas commented 6 years ago

Right, RDTSCP is tricky:

It is serializing instruction, so it will be likely slower (though more accurate) than the cached version.
I am not able to find official OS documentation about useful processor ID guaranteed to be in the aux field
It suspect that it may have really bad performance in virtualized environments

Drawaes commented 6 years ago

I also think you need to be at a privileged level to even call it on most linux's (as Ben said) and I have seen a number of problems with VMs not getting it passed up correctly from the hypervisor so I think the cached version is the best middle ground for reliability and performance.

VSadov commented 6 years ago

Yeah. I wish OS guys made this info available for cheap. Could be just stashed in the TIB and updated on context switches..

Drawaes commented 6 years ago

Dreams are free ... :) But there is useful stuff like the GDI Brush in the TIB ... ;)

Drawaes commented 6 years ago

I am pretty sure the data FYI is in the KTHREAD struct (which of course is in the kernel space but still) so you would think it would be simple to put the cpu id in there when scheduling.... the TIB has plenty of space to burn.

If only there were people that worked at the OS's manufacturers company ;)

VSadov commented 6 years ago

Oh, the OS certainly knows what core the thread is running on, about to run on, or even used to run in the past. How else the soft affinity would be arranged...

It is just the APIs to fetch that from userspace are all slow.

When manycore becoming a norm, I am puzzled why there is no pressure to make this as fast as say GetTickCount.

Drawaes commented 6 years ago

Yeah its all in KTHREAD (next proc, etc) I think data partitioning, lock partitioning etc will become more important and I also was puzzled when looking into it as to why it was so expensive to find out where you are running?

VSadov commented 6 years ago

There is also RDPID instruction whose purpose seems to be specifically to read core ID fast. Perhaps that one will not need caching (when available).

dotnet / runtime