API to specify number of threads, from threadpool, to use for the task

kimishpatel commented 2 years ago

Summary: This PR adds a way use fewer threads than configured with in pthreadpool. Occassionaly it has been seen that using the # of thredas = logical core is not efficient. This can be due to system load and varying other factors that lead threads either being mapped to slower cores or being mapped to fewer than logical core (as actually seen). Thus this PR attempt to fix this.

Approach:

Add api to set thread local var for specifying the #of threads to use.
pthreadpool_parallelize will then distributed the work only among specified threads.
Threads that are not picked continue to wait, likely via mutex/condvar, for next chunk of work and thus give up their cpu slot. Both pthreads.c windows.c are modified to add this feature.

Test Plan: 4 tests are added to check this.

Reviewers:

Subscribers:

Tasks:

Tags:

kimishpatel commented 2 years ago

@Maratyszcza would love to get your inputs here. Thanks!

kimishpatel commented 2 years ago

Thanks @kimishpatel Overall I'm still not understanding why capping the # threads requires the bulk of the changes here.

Is there a simple explanation?

Not sure what you mean. Are you saying that your intuitive understanding is that it should be a simpler change?

The changes in other places are required because we want to:

Distribute work only among max threads (I think that terminology is probably better than capped threads)
Have each thread figure out whether it has work to do. THis is needed because each threadpool does no have its own command queue (I have thought of that change as well but that is a bigger change)
The cap should not be observed at the time of threadpool creation and destruction.

kimishpatel commented 2 years ago

My biggest concern is that this code doesn't do what you expect: it still wakes up all threads in pthreadpool, and them sends some of them to sleep right away. I.e. work dispatch doesn't use all threads, but still pays the full latency cost of synchronizing all threads, even unused ones.

That is correct, but to do more appropriate fix, it requires us to have separate command buffer for each thread in the pool. However, that is a much larger change so I refrained from it. Another reason was that it also complicates work stealing, although it should be doable.

My thought was to try such a change in a follow-up PR, but I am open to suggestions and if you think you want to do that from the get-go, thats also ok.

Also, many formatting inconsistencies, please format similarly to existing pthreadpool code.

Yes, my bad. I did not realize this. WIll fix.

kimishpatel commented 2 years ago

My biggest concern is that this code doesn't do what you expect: it still wakes up all threads in pthreadpool, and them sends some of them to sleep right away. I.e. work dispatch doesn't use all threads, but still pays the full latency cost of synchronizing all threads, even unused ones.

Also btw for the behavior we observed where 4 threads can get mapped to 3 cores, which results in thread swapping each other out, this solution still works even though unused threads are spuriously woken up, since, I suppose, latency of compute tends to be longer than waking and going back to sleep.

kimishpatel commented 2 years ago

@Maratyszcza can you please re-review this?

In the latest commit I have addressed two of your concerns:

All threads waking up but only some participating. This is fixed by making command/wakeup logic per thread.
Use pthreadpool object to convey # of thread to use rather than thread_local variable. This is in the second commit.

I personally feel the second commit somewhat diminishes the value of what we are trying to achieve here. If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2. On the other hand I do understand your concern about multiple threadpool objects being subject to same constraint.

If you have a better suggestion, I am happy to hear.

Look forward to your comments.

kimishpatel commented 2 years ago

Responding here:

Overall, the code needs substantial changes:

Avoid changing existing internal APIs unless it is absolutely needed to add the new functionality. This PR is already very big by itself, and mixed-in changes in unrelated APIs make it hard to review.

Internal API changes are needed because we add per thread command and wait logic.

Please rename the existing threads_count to max_threads_count and use threads_count for the new functionality.

Good suggestion. Will follow up.

Validate that threads_count <= max_threads_count once in the public API that sets this variable. Doesn't clip to max_threads_count in other functions, just assume this holds (optionally add an assert for this).

Makes sense.

Make sure it is tested on Windows and iOS/Mac. I suspect the current version may not compile on these platforms.

Will do.

@Maratyszcza this leaves me with one high level question that I mentioned in my earlier message. I am copy pasting that here:

Should thread_count be set per threadpool object or set as a thread local variable.

If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2.

I do understand your concern that with thread local pattern, multiple threadpool objects being subject to the same constraint.

Maratyszcza commented 2 years ago

Should thread_count be set per threadpool object or set as a thread local variable.

pthreadpool is a low-level library, and for low-level libraries it is preferred to avoid global objects. If necessary, users can implement this functionality at a higher level on top of pthreadpool.

If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2.

I don't expect it to be a common use-case to use different number of thread pool threads depending on which thread called into it, especially with the current implementation that still wakes up all threads.

kimishpatel commented 2 years ago

I don't expect it to be a common use-case to use different number of thread pool threads depending on which thread called into it, especially with the current implementation that still wakes up all threads.

We usually use a singleton pthreadpool object. Thus multiple threads use single threadpool. There are cases when multiple models can be running simultaneously. When threads running individual models are not interleaved, implementation of this PR is ok. However, we dont have any control over how these run in sw stack in which pytorch is integrated. If two threads running two different models start using new API then we may run into weird performance issues which might be hard to debug, such as the last thread to set the thread count wins.

Issue is once we expose this API to client, they expect certain behavior which is not guaranteed. That is why thread local setting seems to make more sense to me.

But if you feel strongly about this, I understand.

kimishpatel commented 2 years ago

Tested on windows and mac

Maratyszcza / pthreadpool

API to specify number of threads, from threadpool, to use for the task #17