OpenEneaLinux / rt-tools

Collection of Linux tools for achieving real-time performance
BSD 3-Clause "New" or "Revised" License
32 stars 17 forks source link

Multithreaded RT process not running on multiple cores #6

Closed cinderblock closed 3 years ago

cinderblock commented 5 years ago

Hello, my apologies if this is not a great place to ask but thought it was worth a shot.

I'm using partrt to get my Node.js application to be the only thing running on 2 of my Raspberry Pi 3/4's 4 cores. The following seems to work and jitter in many places of my application are reduced as expected.

sudo partrt create 0xc
sudo partrt run -c 0x8 rt /usr/bin/node daemon

This is also confirmed by mpstat -P ALL 5 2:

06:29:20 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:29:25 PM  all   35.01    0.00    3.02    0.10    0.00    0.72    0.00    0.00    0.00   61.16
06:29:25 PM    0   40.04    0.00    0.63    0.21    0.00    2.73    0.00    0.00    0.00   56.39
06:29:25 PM    1   10.65    0.00    1.25    0.21    0.00    0.21    0.00    0.00    0.00   87.68
06:29:25 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:29:25 PM    3   88.93    0.00   10.06    0.00    0.00    0.00    0.00    0.00    0.00    1.01

06:29:25 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:29:30 PM  all   18.81    0.00    7.95    0.05    0.00    4.69    0.00    0.00    0.00   68.50
06:29:30 PM    0    0.81    0.00    7.86    0.20    0.00   18.55    0.00    0.00    0.00   72.58
06:29:30 PM    1    2.50    0.00    3.54    0.00    0.00    0.00    0.00    0.00    0.00   93.96
06:29:30 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:29:30 PM    3   72.63    0.00   20.58    0.00    0.00    0.00    0.00    0.00    0.00    6.79

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   26.89    0.00    5.49    0.08    0.00    2.71    0.00    0.00    0.00   64.84
Average:       0   20.04    0.00    4.32    0.21    0.00   10.79    0.00    0.00    0.00   64.65
Average:       1    6.57    0.00    2.40    0.10    0.00    0.10    0.00    0.00    0.00   90.82
Average:       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:       3   80.87    0.00   15.26    0.00    0.00    0.00    0.00    0.00    0.00    3.87

Unfortunately, this doesn't quite get me to where I need to be.

Other parts of my testing have pointed at certain less critical parts of the code blocking the more RT/jitter sensitive aspects. Because of this, I'm working on moving those less critical parts to a child_process.fork(). A first step on that is to just test worst case if the child fork used too much CPU, what would happen. My basic understanding is that a simple while (true) {} should eat all the CPU time and should be an effective test for this. However, when I look at the CPU usage, it seems to have put both threads on the same core:

06:37:51 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:37:56 PM  all   27.10    0.00    3.49    0.10    0.00    0.71    0.00    0.00    0.00   68.60
06:37:56 PM    0   10.29    0.00    3.99    0.42    0.00    2.94    0.00    0.00    0.00   82.35
06:37:56 PM    1    3.78    0.00    3.59    0.00    0.00    0.00    0.00    0.00    0.00   92.63
06:37:56 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:37:56 PM    3   93.60    0.00    6.40    0.00    0.00    0.00    0.00    0.00    0.00    0.00

06:37:56 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:38:01 PM  all   30.77    0.00    5.70    0.00    0.00    4.05    0.00    0.00    0.00   59.48
06:38:01 PM    0   25.05    0.00    1.86    0.00    0.00   15.03    0.00    0.00    0.00   58.07
06:38:01 PM    1    4.13    0.00   14.13    0.00    0.00    0.00    0.00    0.00    0.00   81.74
06:38:01 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:38:01 PM    3   92.20    0.00    7.80    0.00    0.00    0.00    0.00    0.00    0.00    0.00

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   28.94    0.00    4.60    0.05    0.00    2.39    0.00    0.00    0.00   64.02
Average:       0   18.13    0.00    2.86    0.20    0.00    9.36    0.00    0.00    0.00   69.46
Average:       1    3.95    0.00    8.63    0.00    0.00    0.00    0.00    0.00    0.00   87.42
Average:       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:       3   92.90    0.00    7.10    0.00    0.00    0.00    0.00    0.00    0.00    0.00

Are my assumptions wrong? Will Node.js keep all the threads (including libuv workers) on the same core as the parent? Is while (true); not a good test? Is there some functionality of partrt that I'm missing? Thanks.

matslil commented 5 years ago

In the RT partition SMP is disabled, meaning that no load balancing will be done. You need to specify which core to run on. The partrt script can help you with this but this requires that you split your application into separate binaries where you start each using separate calls with partrt.

The other option would be to use a system call from the application itself requesting a thread to be moved to a certain core. I don't know node.js or JavaScript in general to give any hints on how to do this though.

/Mats

On Thu, Sep 12, 2019, 03:44 Cameron Tacklind notifications@github.com wrote:

Hello, my apologies if this is not a great place to ask but thought it was worth a shot.

I'm using partrt to get my Node.js application to be the only thing running on 2 of my Raspberry Pi 3/4's 4 cores. The following seems to work and jitter in many places of my application are reduced as expected.

sudo partrt create 0xc sudo partrt run -c 0x8 rt /usr/bin/node daemon

This is also confirmed by mpstat -P ALL 5 2:

06:29:20 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 06:29:25 PM all 35.01 0.00 3.02 0.10 0.00 0.72 0.00 0.00 0.00 61.16 06:29:25 PM 0 40.04 0.00 0.63 0.21 0.00 2.73 0.00 0.00 0.00 56.39 06:29:25 PM 1 10.65 0.00 1.25 0.21 0.00 0.21 0.00 0.00 0.00 87.68 06:29:25 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 06:29:25 PM 3 88.93 0.00 10.06 0.00 0.00 0.00 0.00 0.00 0.00 1.01

06:29:25 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 06:29:30 PM all 18.81 0.00 7.95 0.05 0.00 4.69 0.00 0.00 0.00 68.50 06:29:30 PM 0 0.81 0.00 7.86 0.20 0.00 18.55 0.00 0.00 0.00 72.58 06:29:30 PM 1 2.50 0.00 3.54 0.00 0.00 0.00 0.00 0.00 0.00 93.96 06:29:30 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 06:29:30 PM 3 72.63 0.00 20.58 0.00 0.00 0.00 0.00 0.00 0.00 6.79

Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 26.89 0.00 5.49 0.08 0.00 2.71 0.00 0.00 0.00 64.84 Average: 0 20.04 0.00 4.32 0.21 0.00 10.79 0.00 0.00 0.00 64.65 Average: 1 6.57 0.00 2.40 0.10 0.00 0.10 0.00 0.00 0.00 90.82 Average: 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 3 80.87 0.00 15.26 0.00 0.00 0.00 0.00 0.00 0.00 3.87

Unfortunately, this doesn't quite get me to where I need to be.

Other parts of my testing have pointed at certain less critical parts of the code blocking the more RT/jitter sensitive aspects. Because of this, I'm working on moving those less critical parts to a child_process.fork(). A first step on that is to just test worst case if the child fork used too much CPU, what would happen. My basic understanding is that a simple while (true) {} should eat all the CPU time and should be an effective test for this. However, when I look at the CPU usage, it seems to have put both threads on the same core:

06:37:51 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 06:37:56 PM all 27.10 0.00 3.49 0.10 0.00 0.71 0.00 0.00 0.00 68.60 06:37:56 PM 0 10.29 0.00 3.99 0.42 0.00 2.94 0.00 0.00 0.00 82.35 06:37:56 PM 1 3.78 0.00 3.59 0.00 0.00 0.00 0.00 0.00 0.00 92.63 06:37:56 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 06:37:56 PM 3 93.60 0.00 6.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00

06:37:56 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 06:38:01 PM all 30.77 0.00 5.70 0.00 0.00 4.05 0.00 0.00 0.00 59.48 06:38:01 PM 0 25.05 0.00 1.86 0.00 0.00 15.03 0.00 0.00 0.00 58.07 06:38:01 PM 1 4.13 0.00 14.13 0.00 0.00 0.00 0.00 0.00 0.00 81.74 06:38:01 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 06:38:01 PM 3 92.20 0.00 7.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 28.94 0.00 4.60 0.05 0.00 2.39 0.00 0.00 0.00 64.02 Average: 0 18.13 0.00 2.86 0.20 0.00 9.36 0.00 0.00 0.00 69.46 Average: 1 3.95 0.00 8.63 0.00 0.00 0.00 0.00 0.00 0.00 87.42 Average: 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 3 92.90 0.00 7.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Are my assumptions wrong? Will Node.js keep all the threads (including libuv workers) on the same core as the parent? Is while (true); not a good test? Is there some functionality of partrt that I'm missing? Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OpenEneaLinux/rt-tools/issues/6?email_source=notifications&email_token=ABDLJ2IZA5PXZ7GG3P27233QJGNHZA5CNFSM4IV4Q3IKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HK3S6JQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDLJ2I5WAXWV2PRDA2FRULQJGNHZANCNFSM4IV4Q3IA .

cinderblock commented 5 years ago

Thank you very much for the reply. This is very helpful. Armed with this info, I'll do some searching. Looking towards answering these:

Is SMP required to be off for partrt's quasi RT mode to work? Should I be able to turn on SMP just for my multi-threaded "RT" daemon?

Is the "separate binaries" requirement because partrt works on PIDs and a single process with many threads doesn't have multiple PIDs?

If I were to get the system call (sched_setaffinity, right?) to move a thread to a certain core, would that be on a per-thread basis? Looks like this should be pretty easy with https://www.npmjs.com/package/nodeaffinity

cinderblock commented 5 years ago

Looks like nodeaffinity is working. I can start the node application with partrt and it consistently runs all the threads on cpu 3. In node, I call setAffinity(4) and can see that the main node thread has moved to cpu 2 (that was previously idle) but that some of the computation is left on cpu 3, which I'm presuming are Node's libuv worker threads. I haven't figured out a good way to really verify this more than just seeing that what I assume are the libuv threads using more cpu load when doing heavy file/network IO.

matslil commented 5 years ago

Yes, turning off SMP is part of what makes the RT partition real-time, since SMP implies that the kernel will move threads around between cores whenever the load changes. Each such move will suspend the thread until the move has finished, which is bad for real-time characteristics.

The requirement for separate binaries is merely an API thing. There are other tools available that can move your threads after they've started, although this makes it hard for the application to know when it can expect real-time performance. That's why I suggested to let the application move selected threads to other cores itself.

/Mats

On Thu, Sep 12, 2019, 22:56 Cameron Tacklind notifications@github.com wrote:

Thank you very much for the reply. This is very helpful. Armed with this info, I'll do some searching. Looking towards answering these:

Is SMP required to be off for partrt's quasi RT mode to work? Should I be able to turn on SMP just for my multi-threaded "RT" daemon?

Is the "separate binaries" requirement because partrt works on PIDs and a single process with many threads doesn't have multiple PIDs?

If I were to get the system call (sched_setaffinity, right?) to move a thread to a certain core, would that be on a per-thread basis? Looks like this should be pretty easy with https://www.npmjs.com/package/nodeaffinity

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenEneaLinux/rt-tools/issues/6?email_source=notifications&email_token=ABDLJ2JVWBAFNZ56PNHRTXLQJKUHLA5CNFSM4IV4Q3IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6THOOI#issuecomment-531003193, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDLJ2KRBMPC6DBBV64VSDTQJKUHLANCNFSM4IV4Q3IA .

matslil commented 5 years ago

Sounds like a good solution! Perhaps the main thread should have started in non-RT partition and then move the RT thread to a core in the RT partition?

/Mats

On Fri, Sep 13, 2019, 02:03 Cameron Tacklind notifications@github.com wrote:

Looks like nodeaffinity is working. I can start the node application with partrt and it consistently runs all the threads on cpu 3. In node, I call setAffinity(4) and can see that the main node thread has moved to cpu 2 (that was previously idle) but that some of the computation is left on cpu 3, which I'm presuming are Node's libuv worker threads. I haven't figured out a good way to really verify this more than just seeing that what I assume are the libuv threads using more cpu load when doing heavy file/network IO.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenEneaLinux/rt-tools/issues/6?email_source=notifications&email_token=ABDLJ2NUK2ICKC7I74JHVCTQJLKF7A5CNFSM4IV4Q3IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6TTFEQ#issuecomment-531051154, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDLJ2PSEGT4MD3LEMWZPK3QJLKF7ANCNFSM4IV4Q3IA .

cinderblock commented 5 years ago

If a thread is the only thread with affinity for a particular core, shouldn't SMP not matter (for that thread/core)?

I was considering trying something like this (starting in non-RT and moving critical thread to RT). I only have one, maybe two, threads that I'd like to run as fast as possible/little jitter as possible. The rest should only be run as needed and don't need strict timing guarantees.

My naive understanding is that partrt also sets up "cpu sets" that do a sort of extra level of partitioning and would prevent a process from spanning across the sets. I presume this has to do with how the scheduler works. Maybe this understanding is flawed?

I'm also considering replicating partrt's functionality in my codebase. Might be nice to not rely on an extra tool to be installed to start properly. It is handy to keep all the logic in a single language and my initial readthrough suggests it's just interactive with the kernel through sysfs so it shouldn't be too bad. What am I missing?

matslil commented 3 years ago

Sorry for the late reply (private life got in between...), which unfortunately means it was a long time ago I thought about partrt design... So here is my take on what I remember (if any of your questions are still relevant?): Basically you have two CPU sets: One RT, one non-RT. The sets merely says which CPU:s belong to which category. By disabling SMP in RT, scheduler will not change which CPU to run on once the process has started, while on non-RT scheduler is free to do load-balancing as usual.

Feel free to study the code and see how things are done and do your customized version of it. This is what open-source code is for. Just one note on this though: If what you inherit into your code is complex, then you actually still have a dependency since if somethings happens to be fixed in this tool, you might be interested in having this fix in your application as well. In this specific case I don't think this is an issue, since this is not very complex and this tool isn't changing that rapidly. So see this as more of a general comment.