Closed kilicomu closed 1 year ago
Hey, thanks again for the report, i'm sorry, i'm super-short on time today, tonight perhaps i will continue, had such cases in the past, i'll give you the usual debug process that might shed some light:
/home/runner/work/SlurmCommander/SlurmCommander/internal/model/tabs/jobhisttab/jobhisttabtable.go:76 +0x2df
this line breaks, check what it does and then execute the sacct call from the debug log:
SC: 09:33:40.003240 jobhisttabcommands.go:50: EXEC: "/usr/bin/sacct" ["-n" "--json" "-S" "now-7days" "-A" "chem-acm-2018,its-training-2019,its-devel-2018,its-system-2018"]
Find the var from the offending line in json, might be that one of the var. references in that line comes in as empty in json (making the pointes go nil).
Hope that helps to kickstart your debugging, i'll join you later today/tomorrow.
@pja237 Thanks for the pointers. I added some of my own debugging statements to the code and found that it doesn't like this accounting database entry:
{
"account": "its-devel-2018",
"comment": {
"administrator": null
},
"allocation_nodes": 3,
"array": {
"job_id": 0,
"limits": {
"max": {
"running": {
"tasks": 0
}
}
},
"task": null,
"task_id": null
},
"association": null
}
Which looks like a suspicious database entry to me! I'll take a look and see how this has got into the database.
Given that this kind of database entry is possible, maybe it's worth wrapping the string conversion here:
line := strings.Join([]string{
strconv.Itoa(*v.JobId),
*v.Name,
*v.Qos,
*v.Account,
*v.User,
*v.State.Current,
}, ".")
to mask the error caused by this problem and still display job information to the user. The problem jobs can be logged to stderr
/ the debug log.
Somewhat related, I've just noticed that sacct
(at least with my Slurm version, 22.05.4) behaves fundamentally differently when --json
is passed as an option.
Without --json
, sacct
only reports on my jobs. With --json
, sacct
reports on all user jobs, as if --allusers
had been passed. Do you see the same behaviour? If so, is it intentional to see all user jobs for the selected accounts in the job history tab?
Great catch,
is this the whole json entry for the job?
{ "account": "its-devel-2018", "comment": { "administrator": null }, "allocation_nodes": 3, "array": { "job_id": 0, "limits": { "max": { "running": { "tasks": 0 } } }, "task": null, "task_id": null }, "association": null }
Error handling in this situation, like you've described is a must, will start working on that.
As far as the --json
behavior goes, yes, it is same in slurm 21 as well, --json
returns all jobs, only limiting that i've found to be working is the -A acct_list
switch.
That is the reason you'll get jobs from all users accounts in the jobhist tab.
Now, we could filter that down to just the calling user, but then again, i let it be like that, filtering is easy and sometimes group members want to inspect their colleagues jobs, or get some run/wait time stats for the whole group.
Yes, that's the whole json entry for the job.
With you on the filtering, nicer to get a bigger picture first then filter down.
That is one weird entry π It's then not just this that'll break: https://github.com/CLIP-HPC/SlurmCommander/blob/707f0307919b57bc26b838b9154005bcb963c134/internal/model/tabs/jobhisttab/jobhisttabtable.go#L75-L82
it'll also break later here: https://github.com/CLIP-HPC/SlurmCommander/blob/707f0307919b57bc26b838b9154005bcb963c134/internal/model/tabs/jobhisttab/jobhisttabtable.go#L85
and anywhere later in the code if we try to dereference any of those pointers, because they all get set to nil
value if they're not received via json.
Need to think how to handle this extreme... easiest would be to discard the whole entry if it doesn't have jobid set?
Because this doesn't look like a meaningful job entry.
How did this even get into the db? is this the first job entry in db? do you have any more like this?
What do you think about discarding this altogether?
I currently have no idea how it got into the db!
I think checking the job id for something that looks valid and discarding if not makes sense - it's clearly a garbage entry. Worth flagging in the log that an entry was discarded when it happened (and why it was discarded), but the db entry is meaningless.
On Thu, 22 Dec 2022 at 12:29, PetarJ @.***> wrote:
That is one weird entry π It's then not just this that'll break:
it'll also break later here:
and anywhere later in the code if we try to dereference any of those pointers, because they all get set to nil value if they're not received via json. Need to think how to handle this extreme... easiest would be to discard the whole entry if it doesn't have jobid set? Because this doesn't look like a meaningful job entry. How did this even get into the db? is this the first job entry in db? do you have any more like this? What do you think about discarding this altogether?
β Reply to this email directly, view it on GitHub https://github.com/CLIP-HPC/SlurmCommander/issues/9#issuecomment-1362782624, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG5SE3DLLE6GWIXJJJBMC3WORCSZANCNFSM6AAAAAATFNIW7M . You are receiving this because you authored the thread.Message ID: @.***>
-- Killian Murphy Research Software Engineer
Wolfson Atmospheric Chemistry Laboratories University of York Heslington York YO10 5DD +44 (0)1904 32 3634
e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm
Can you download and try this build to see if we got it cleared: https://github.com/CLIP-HPC/SlurmCommander/suites/10011747303/artifacts/486355715
That's looking good!
I'll have more of a play with scom now and let you know if anything else comes up.
On Thu, 22 Dec 2022 at 13:23, PetarJ @.***> wrote:
Can you download and try this build to see if we got it cleared:
https://github.com/CLIP-HPC/SlurmCommander/suites/10011747303/artifacts/486355715
β Reply to this email directly, view it on GitHub https://github.com/CLIP-HPC/SlurmCommander/issues/9#issuecomment-1362836696, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG5SE7KHJFVTEFZ4KTLAGLWORI65ANCNFSM6AAAAAATFNIW7M . You are receiving this because you authored the thread.Message ID: @.***>
-- Killian Murphy Research Software Engineer
Wolfson Atmospheric Chemistry Laboratories University of York Heslington York YO10 5DD +44 (0)1904 32 3634
e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm
Great news, i'll merge and do a release, let me know whenever you got any new ideas/bugs π
Hi,
With v1.0.1, I see:
shortly after starting the program.
The debug log doesn't seem to provide any useful info on this one:
I'm not familiar with Go, but I'll download the
scom
source and have a look at debugging the issue.