OleHolmNielsen / Slurm_tools

My tools for the Slurm HPC workload manager
GNU General Public License v3.0
444 stars 96 forks source link

Flag '-F' showing magenta nodes #14

Closed tardigradus closed 2 years ago

tardigradus commented 3 years ago

Hi Ole,

Using the flag -f was giving me too many results, so I tried -F. However this is still showing magenta nodes:

$ pestat -F -u alice                                                                              
Print only nodes that are flagged by * (RED nodes)
Select a single Slurm user: alice
Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                            State Use/Tot  (15min)     (MB)     (MB)  JobID(JobArrayID) User ...
    c103           main*      mix  14  32   14.32    192000    89277* 7683123 alice 7677623(7677209_27) bob 7677209(7677209_28) bob 7678231 carol *
    c104           main*      mix  13  32   13.32    192000    74430* 7683130 alice 7689301(7688525_740) dave 7689290(7688525_729) dave 7678231 carol *

Obviously you can't see the colour above but everything in the Freemem and Joblist columns ist magenta. There is nothing in red.

What am I doing wrong?

OleHolmNielsen commented 3 years ago

The Memsize and Freemem are read from the sinfo command output. What does sinfo tell you about the magenta nodes, for example: $ sinfo -N -n c103,c104 -O "NodeList:30,CPUsState:30,CPUsLoad:30,Memory:30,FreeMem:30"

The coloring (red or magenta) is only done if this test in pestat is true: freemem < memory*memory_thres1 (or 2) and pestat defines memory_thres1=0.1 (and 0.2 for memory_thres2). So Freemem should be <20% of Memsize in order to get Magenta color, and <10% for Red. Therefore I cannot make sense of your output from pestat. Are you using the latest and unmodified version of pestat from Github?

tardigradus commented 3 years ago

I am using the latest version from git without any local settings in /etc/pestat.conf or ~/.pestat.conf. If I look at the corresponding output of sinfo, I get

$ pestat -F -n c100
Print only nodes that are flagged by * (RED nodes)
Select only nodes in hostlist=c100
Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                            State Use/Tot  (15min)     (MB)     (MB)  JobID(JobArrayID) User ...
    c100           main*      mix  19  32   19.04     95300    89098  7767715 alice 7777681 bob 7777694 bob *

$ sinfo -N -n c100 -O "NodeList:30,CPUsState:30,CPUsLoad:30,Memory:30,FreeMem:30"
NODELIST                      CPUS(A/I/O/T)                 CPU_LOAD                      MEMORY                        FREE_MEM
c100                          19/13/0/32                    19.04                         95300                         89098

$ echo '100*89098/95300' | bc -l
93.49213011542497376705

So it looks like the entry is being correctly identified as having Freemem < 10%, but is not being displayed in red.

However I have just run the command twice more:

$ pestat -F -n c100
Print only nodes that are flagged by * (RED nodes)
Select only nodes in hostlist=c100
Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                            State Use/Tot  (15min)     (MB)     (MB)  JobID(JobArrayID) User ...
    c100           main*      mix  10  32   13.31*    95300    89105  7767715 malischewski 7777694 samolesnik *

In the above the CPUload is shown in red and the Joblist in magenta.

$ pestat -F -n c100
Print only nodes that are flagged by * (RED nodes)
Select only nodes in hostlist=c100
Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                            State Use/Tot  (15min)     (MB)     (MB)  JobID(JobArrayID) User ...
    c100           main*      mix  10  32   11.33*    95300    89105  7767715 malischewski 7777694 samolesnik *

In the above the CPUload is shown in magenta and the Joblist also in magenta.

That seems rather strange.

OleHolmNielsen commented 3 years ago

In your examples, the Freemem is >90% (89098 out of 95300), so that should neither be red nor magenta! Do you have some other examples with Freemem being a small value (<10% of Memsize)?

tardigradus commented 3 years ago

Oops! You are right, of course. No, I don't currently have any with Freemem <10% of Memsize but even for the ones with <20%, Freemem is never mangenta although Joblist is. Perhaps there is something odd about my terminal.

tardigradus commented 2 years ago

This issue seems to be fixed in the current version so I'm closing it.