Open kimishpatel opened 2 years ago
@Maratyszcza would love to get your inputs here. Thanks!
Thanks @kimishpatel Overall I'm still not understanding why capping the # threads requires the bulk of the changes here.
Is there a simple explanation?
Not sure what you mean. Are you saying that your intuitive understanding is that it should be a simpler change?
The changes in other places are required because we want to:
My biggest concern is that this code doesn't do what you expect: it still wakes up all threads in pthreadpool, and them sends some of them to sleep right away. I.e. work dispatch doesn't use all threads, but still pays the full latency cost of synchronizing all threads, even unused ones.
That is correct, but to do more appropriate fix, it requires us to have separate command buffer for each thread in the pool. However, that is a much larger change so I refrained from it. Another reason was that it also complicates work stealing, although it should be doable.
My thought was to try such a change in a follow-up PR, but I am open to suggestions and if you think you want to do that from the get-go, thats also ok.
Also, many formatting inconsistencies, please format similarly to existing pthreadpool code.
Yes, my bad. I did not realize this. WIll fix.
My biggest concern is that this code doesn't do what you expect: it still wakes up all threads in pthreadpool, and them sends some of them to sleep right away. I.e. work dispatch doesn't use all threads, but still pays the full latency cost of synchronizing all threads, even unused ones.
Also btw for the behavior we observed where 4 threads can get mapped to 3 cores, which results in thread swapping each other out, this solution still works even though unused threads are spuriously woken up, since, I suppose, latency of compute tends to be longer than waking and going back to sleep.
@Maratyszcza can you please re-review this?
In the latest commit I have addressed two of your concerns:
I personally feel the second commit somewhat diminishes the value of what we are trying to achieve here. If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2. On the other hand I do understand your concern about multiple threadpool objects being subject to same constraint.
If you have a better suggestion, I am happy to hear.
Look forward to your comments.
Responding here:
Overall, the code needs substantial changes:
- Avoid changing existing internal APIs unless it is absolutely needed to add the new functionality. This PR is already very big by itself, and mixed-in changes in unrelated APIs make it hard to review.
Internal API changes are needed because we add per thread command and wait logic.
- Please rename the existing
threads_count
tomax_threads_count
and usethreads_count
for the new functionality.
Good suggestion. Will follow up.
- Validate that
threads_count <= max_threads_count
once in the public API that sets this variable. Doesn't clip tomax_threads_count
in other functions, just assume this holds (optionally add anassert
for this).
Makes sense.
- Make sure it is tested on Windows and iOS/Mac. I suspect the current version may not compile on these platforms.
Will do.
@Maratyszcza this leaves me with one high level question that I mentioned in my earlier message. I am copy pasting that here:
Should thread_count be set per threadpool object or set as a thread local variable.
If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2.
I do understand your concern that with thread local pattern, multiple threadpool objects being subject to the same constraint.
Should thread_count be set per threadpool object or set as a thread local variable.
pthreadpool is a low-level library, and for low-level libraries it is preferred to avoid global objects. If necessary, users can implement this functionality at a higher level on top of pthreadpool.
If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2.
I don't expect it to be a common use-case to use different number of thread pool threads depending on which thread called into it, especially with the current implementation that still wakes up all threads.
I don't expect it to be a common use-case to use different number of thread pool threads depending on which thread called into it, especially with the current implementation that still wakes up all threads.
We usually use a singleton pthreadpool object. Thus multiple threads use single threadpool. There are cases when multiple models can be running simultaneously. When threads running individual models are not interleaved, implementation of this PR is ok. However, we dont have any control over how these run in sw stack in which pytorch is integrated. If two threads running two different models start using new API then we may run into weird performance issues which might be hard to debug, such as the last thread to set the thread count wins.
Issue is once we expose this API to client, they expect certain behavior which is not guaranteed. That is why thread local setting seems to make more sense to me.
But if you feel strongly about this, I understand.
Tested on windows and mac
Summary: This PR adds a way use fewer threads than configured with in pthreadpool. Occassionaly it has been seen that using the # of thredas = logical core is not efficient. This can be due to system load and varying other factors that lead threads either being mapped to slower cores or being mapped to fewer than logical core (as actually seen). Thus this PR attempt to fix this.
Approach:
Test Plan: 4 tests are added to check this.
Reviewers:
Subscribers:
Tasks:
Tags: