NAICNO / Jobanalyzer

Easy to use resource usage report
MIT License
0 stars 1 forks source link

Sonar/MLX: ML4 and ML7 are not reporting #688

Open lars-t-hansen opened 3 days ago

lars-t-hansen commented 3 days ago

ML7 I know to have been reinstalled and it lost its sonar install.

ML4 is a little more curious but it hasn't been reporting since nov 11. It's been up for longer than that. Sonar appears to run. It tries to send mail about something to me, but the mail never arrives, so mail is misconfigured too (too many things don't work about email to be able to rely on it at all, see #585).

Looks like ML4 could be affected by the problem with parsing rocm-smi output:

[larstha@ml4 sonar-main]$ git checkout release_0_12 
branch 'release_0_12' set up to track 'origin/release_0_12'.
Switched to a new branch 'release_0_12'
[larstha@ml4 sonar-main]$ cargo build
   Compiling sonar v0.12.2 (/itf-fi-ml/home/larstha/p/sonar-main)
    Finished dev [unoptimized + debuginfo] target(s) in 1.87s
[larstha@ml4 sonar-main]$ target/debug/sonar ps
thread 'main' panicked at src/amd.rs:223:22:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

and then

[larstha@ml4 sonar-main]$ git checkout main 
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
[larstha@ml4 sonar-main]$ cargo build
   Compiling sonar v0.13.0-devel (/itf-fi-ml/home/larstha/p/sonar-main)
    Finished dev [unoptimized + debuginfo] target(s) in 1.75s
[larstha@ml4 sonar-main]$ target/debug/sonar ps
v=0.13.0-devel,time=2024-11-19T09:04:28+01:00,host=ml4.hpc.uio.no,user=root,cmd=xfsalloc,pid=1376,ppid=2
v=0.13.0-devel,time=2024-11-19T09:04:28+01:00,host=ml4.hpc.uio.no,user=root,cmd=ksoftirqd/60,pid=383,ppid=2,cputime_sec=3
v=0.13.0-devel,time=2024-11-19T09:04:28+01:00,host=ml4.hpc.uio.no,user=root,cmd=ksoftirqd/8,pid=65,ppid=2,cputime_sec=1
...

So the AMD parsing thing (probably) should be backported to v0.12 or we should move to v0.13-devel on ML4.

lars-t-hansen commented 3 days ago

Looks like the upgrade to the AMD drivers broke the AMD info gathering in Sonar: https://github.com/NordicHPC/sonar/issues/209.