CLIP-HPC / SlurmCommander

Slurm TUI
MIT License
60 stars 6 forks source link

v1.0.1 - runtime error: invalid memory address or nil pointer dereference #9

Closed kilicomu closed 1 year ago

kilicomu commented 1 year ago

Hi,

With v1.0.1, I see:

Welcome to Slurm Commander!

2022/12/21 09:29:38 Trying conf file: /etc/scom/scom.conf
2022/12/21 09:29:38 Conf: FAILED reading /etc/scom/scom.conf
2022/12/21 09:29:38 Trying conf file: /users/klcm500/scom/scom.conf
2022/12/21 09:29:38 INFO: Configuration: &config.ConfigContainer{Prefix:"/usr/bin", Binpaths:map[string]string{"sacct":"/usr/bin/sacct", "sacctmgr":"/usr/bin/sacctmgr", "sbatch":"/usr/bin/sbatch", "scancel":"/usr/bin/scancel", "scontrol":"/usr/bin/scontrol", "sinfo":"/usr/bin/sinfo", "squeue":"/usr/bin/squeue", "sstat":"/usr/bin/sstat"}, Tick:0xa, TemplateDirs:[]string{"/users/klcm500/scom/templates"}}

Caught panic:

runtime error: invalid memory address or nil pointer dereference

Restoring terminal...

goroutine 1 [running]:
runtime/debug.Stack()
        /opt/hostedtoolcache/go/1.18.9/x64/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
        /opt/hostedtoolcache/go/1.18.9/x64/src/runtime/debug/stack.go:16 +0x19
github.com/charmbracelet/bubbletea.(*Program).Run.func1()
        /home/runner/go/pkg/mod/github.com/charmbracelet/bubbletea@v0.23.0/tea.go:385 +0x95
panic({0x5b4600, 0x765a50})
        /opt/hostedtoolcache/go/1.18.9/x64/src/runtime/panic.go:838 +0x207
github.com/CLIP-HPC/SlurmCommander/internal/model/tabs/jobhisttab.(*SacctJSON).FilterSacctTable(0xc000202e08, {0xc000a0a0b0, 0x0}, 0xc000202ce8?)
        /home/runner/work/SlurmCommander/SlurmCommander/internal/model/tabs/jobhisttab/jobhisttabtable.go:76 +0x2df
github.com/CLIP-HPC/SlurmCommander/internal/model.Model.Update({{0x0, 0x1, 0x1, {0x0, 0x0}, {0x0, 0x0}, 0x16a, 0x5f, 0xc0000c8230, ...}, ...}, ...)
        /home/runner/work/SlurmCommander/SlurmCommander/internal/model/update.go:439 +0x1291
github.com/charmbracelet/bubbletea.(*Program).eventLoop(0xc000194580, {0x6534f8?, 0xc0001d4000?}, 0x652c58?)
        /home/runner/go/pkg/mod/github.com/charmbracelet/bubbletea@v0.23.0/tea.go:324 +0x602
github.com/charmbracelet/bubbletea.(*Program).Run(0xc000194580)
        /home/runner/go/pkg/mod/github.com/charmbracelet/bubbletea@v0.23.0/tea.go:448 +0x8e7
main.main()
        /home/runner/work/SlurmCommander/SlurmCommander/cmd/scom/scom.go:142 +0x16ce
Goodbye!

shortly after starting the program.

The debug log doesn't seem to provide any useful info on this one:

SC: 09:33:39.786522 logger.go:36: Log file: scdebug.log
SC: 09:33:39.787134 jobtabcommands.go:56: QuickGetSqueue() start
SC: 09:33:39.787177 clustertabcommands.go:56: QuickGetSinfo() start
SC: 09:33:39.787869 view.go:98: Got NO error, insert newline
SC: 09:33:39.787890 view.go:105: CALL JobTab.View()
SC: 09:33:39.787912 jobtabview.go:169: IN JobTab.View()
SC: 09:33:39.789199 command.go:38: Fetching UserName
SC: 09:33:39.789342 jobfromtemplate.go:70: GetTemplateList reading dir: /users/klcm500/scom/templates
SC: 09:33:39.789642 update.go:342: Update: got WindowSizeMsg: 362 95
SC: 09:33:39.789814 jobtab.go:53: FixTableHeight(95) from 20
SC: 09:33:39.789852 jobtab.go:59: FixTableHeight to 80
SC: 09:33:39.789864 jobhisttab.go:50: FixTableHeight(95) from 20
SC: 09:33:39.789881 jobhisttab.go:56: FixTableHeight to 80
SC: 09:33:39.789893 clustertab.go:43: FixTableHeight(95) from 20
SC: 09:33:39.789905 clustertab.go:49: FixTableHeight to 70
SC: 09:33:39.789919 update.go:372: CTB Width = 0 
SC: 09:33:39.789965 update.go:374: CTB Width = 228 
SC: 09:33:39.790682 view.go:98: Got NO error, insert newline
SC: 09:33:39.792008 view.go:105: CALL JobTab.View()
SC: 09:33:39.792036 jobtabview.go:169: IN JobTab.View()
SC: 09:33:39.792454 command.go:49: Return UserName: klcm500
SC: 09:33:39.796592 update.go:318: Update: Got TemplatesListRows msg: jobfromtemplate.TemplatesListRows(nil)
SC: 09:33:39.797192 view.go:98: Got NO error, insert newline
SC: 09:33:39.797209 view.go:105: CALL JobTab.View()
SC: 09:33:39.797223 jobtabview.go:169: IN JobTab.View()
SC: 09:33:39.801291 update.go:291: Got UserNAme msg, save "klcm500" to Globals.
SC: 09:33:39.801760 command.go:72: GetUserAssoc about to run: /usr/bin/sacctmgr [list Association format=account -P -n user=klcm500]
SC: 09:33:39.801811 view.go:98: Got NO error, insert newline
SC: 09:33:39.801827 view.go:105: CALL JobTab.View()
SC: 09:33:39.801839 jobtabview.go:169: IN JobTab.View()
SC: 09:33:39.891619 update.go:405: U(): got SinfoJSON
SC: 09:33:39.891669 clustertabtable.go:75: FilterSinfoTable: rows 178 
SC: 09:33:39.903476 clustertab.go:63: GetStatsFiltered JobClusterTab start
SC: 09:33:39.904927 clustertab.go:111: GetStatsFiltered end 
SC: 09:33:39.905420 view.go:98: Got NO error, insert newline
SC: 09:33:39.905438 view.go:105: CALL JobTab.View()
SC: 09:33:39.905452 jobtabview.go:169: IN JobTab.View()
SC: 09:33:40.002325 command.go:94: Got UserAssoc klcm500 -> chem-acm-2018
SC: 09:33:40.002359 command.go:94: Got UserAssoc klcm500 -> its-training-2019
SC: 09:33:40.002373 command.go:94: Got UserAssoc klcm500 -> its-devel-2018
SC: 09:33:40.002385 command.go:94: Got UserAssoc klcm500 -> its-system-2018
SC: 09:33:40.002832 update.go:280: Got UserAssoc msg, value: command.UserAssoc{"chem-acm-2018", "its-training-2019", "its-devel-2018", "its-system-2018"}
SC: 09:33:40.002884 update.go:283: Appended UserAssoc msg go Globals, value now: []string{"chem-acm-2018", "its-training-2019", "its-devel-2018", "its-system-2018"}
SC: 09:33:40.002912 update.go:286: Appended UserAssoc msg go Globals, calling GetSacctHist()
SC: 09:33:40.003182 jobhisttabcommands.go:38: GetSacctHist("chem-acm-2018,its-training-2019,its-devel-2018,its-system-2018") start: days 7, timeout: 30
SC: 09:33:40.003240 jobhisttabcommands.go:50: EXEC: "/usr/bin/sacct" ["-n" "--json" "-S" "now-7days" "-A" "chem-acm-2018,its-training-2019,its-devel-2018,its-system-2018"]
SC: 09:33:40.003395 view.go:98: Got NO error, insert newline
SC: 09:33:40.003409 view.go:105: CALL JobTab.View()
SC: 09:33:40.003420 jobtabview.go:169: IN JobTab.View()
SC: 09:33:41.255036 jobhisttabcommands.go:59: EXEC returned: 7483796 bytes
SC: 09:33:41.433352 jobhisttabcommands.go:74: Unmarshalled 563 jobs from hist
SC: 09:33:41.433393 update.go:435: Got SacctJobHist len: 563 
SC: 09:33:41.433415 jobhisttabtable.go:60: FilterSacctTable: rows 563

I'm not familiar with Go, but I'll download the scom source and have a look at debugging the issue.

pja237 commented 1 year ago

Hey, thanks again for the report, i'm sorry, i'm super-short on time today, tonight perhaps i will continue, had such cases in the past, i'll give you the usual debug process that might shed some light:

        /home/runner/work/SlurmCommander/SlurmCommander/internal/model/tabs/jobhisttab/jobhisttabtable.go:76 +0x2df

this line breaks, check what it does and then execute the sacct call from the debug log:

SC: 09:33:40.003240 jobhisttabcommands.go:50: EXEC: "/usr/bin/sacct" ["-n" "--json" "-S" "now-7days" "-A" "chem-acm-2018,its-training-2019,its-devel-2018,its-system-2018"]

Find the var from the offending line in json, might be that one of the var. references in that line comes in as empty in json (making the pointes go nil).

Hope that helps to kickstart your debugging, i'll join you later today/tomorrow.

kilicomu commented 1 year ago

@pja237 Thanks for the pointers. I added some of my own debugging statements to the code and found that it doesn't like this accounting database entry:

     {
       "account": "its-devel-2018",
       "comment": {
         "administrator": null
       },     
       "allocation_nodes": 3,
       "array": {
         "job_id": 0,
         "limits": {
           "max": {
             "running": {
               "tasks": 0
             }      
           }      
         },     
         "task": null,
         "task_id": null
       },     
       "association": null
     }

Which looks like a suspicious database entry to me! I'll take a look and see how this has got into the database.

Given that this kind of database entry is possible, maybe it's worth wrapping the string conversion here:

line := strings.Join([]string{
           strconv.Itoa(*v.JobId),
           *v.Name,
           *v.Qos,
           *v.Account,
           *v.User,
           *v.State.Current,
}, ".")

to mask the error caused by this problem and still display job information to the user. The problem jobs can be logged to stderr / the debug log.

kilicomu commented 1 year ago

Somewhat related, I've just noticed that sacct (at least with my Slurm version, 22.05.4) behaves fundamentally differently when --json is passed as an option.

Without --json, sacct only reports on my jobs. With --json, sacct reports on all user jobs, as if --allusers had been passed. Do you see the same behaviour? If so, is it intentional to see all user jobs for the selected accounts in the job history tab?

pja237 commented 1 year ago

Great catch,

is this the whole json entry for the job?

     {
       "account": "its-devel-2018",
       "comment": {
         "administrator": null
       },     
       "allocation_nodes": 3,
       "array": {
         "job_id": 0,
         "limits": {
           "max": {
             "running": {
               "tasks": 0
             }      
           }      
         },     
         "task": null,
         "task_id": null
       },     
       "association": null
     }

Error handling in this situation, like you've described is a must, will start working on that.

As far as the --json behavior goes, yes, it is same in slurm 21 as well, --json returns all jobs, only limiting that i've found to be working is the -A acct_list switch. That is the reason you'll get jobs from all users accounts in the jobhist tab. Now, we could filter that down to just the calling user, but then again, i let it be like that, filtering is easy and sometimes group members want to inspect their colleagues jobs, or get some run/wait time stats for the whole group.

kilicomu commented 1 year ago

Yes, that's the whole json entry for the job.

With you on the filtering, nicer to get a bigger picture first then filter down.

pja237 commented 1 year ago

That is one weird entry πŸ˜• It's then not just this that'll break: https://github.com/CLIP-HPC/SlurmCommander/blob/707f0307919b57bc26b838b9154005bcb963c134/internal/model/tabs/jobhisttab/jobhisttabtable.go#L75-L82

it'll also break later here: https://github.com/CLIP-HPC/SlurmCommander/blob/707f0307919b57bc26b838b9154005bcb963c134/internal/model/tabs/jobhisttab/jobhisttabtable.go#L85

and anywhere later in the code if we try to dereference any of those pointers, because they all get set to nil value if they're not received via json. Need to think how to handle this extreme... easiest would be to discard the whole entry if it doesn't have jobid set? Because this doesn't look like a meaningful job entry. How did this even get into the db? is this the first job entry in db? do you have any more like this? What do you think about discarding this altogether?

kilicomu commented 1 year ago

I currently have no idea how it got into the db!

I think checking the job id for something that looks valid and discarding if not makes sense - it's clearly a garbage entry. Worth flagging in the log that an entry was discarded when it happened (and why it was discarded), but the db entry is meaningless.

On Thu, 22 Dec 2022 at 12:29, PetarJ @.***> wrote:

That is one weird entry πŸ˜• It's then not just this that'll break:

https://github.com/CLIP-HPC/SlurmCommander/blob/707f0307919b57bc26b838b9154005bcb963c134/internal/model/tabs/jobhisttab/jobhisttabtable.go#L75-L82

it'll also break later here:

https://github.com/CLIP-HPC/SlurmCommander/blob/707f0307919b57bc26b838b9154005bcb963c134/internal/model/tabs/jobhisttab/jobhisttabtable.go#L85

and anywhere later in the code if we try to dereference any of those pointers, because they all get set to nil value if they're not received via json. Need to think how to handle this extreme... easiest would be to discard the whole entry if it doesn't have jobid set? Because this doesn't look like a meaningful job entry. How did this even get into the db? is this the first job entry in db? do you have any more like this? What do you think about discarding this altogether?

β€” Reply to this email directly, view it on GitHub https://github.com/CLIP-HPC/SlurmCommander/issues/9#issuecomment-1362782624, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG5SE3DLLE6GWIXJJJBMC3WORCSZANCNFSM6AAAAAATFNIW7M . You are receiving this because you authored the thread.Message ID: @.***>

-- Killian Murphy Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories University of York Heslington York YO10 5DD +44 (0)1904 32 3634

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm

pja237 commented 1 year ago

Can you download and try this build to see if we got it cleared: https://github.com/CLIP-HPC/SlurmCommander/suites/10011747303/artifacts/486355715

kilicomu commented 1 year ago

That's looking good!

I'll have more of a play with scom now and let you know if anything else comes up.

On Thu, 22 Dec 2022 at 13:23, PetarJ @.***> wrote:

Can you download and try this build to see if we got it cleared:

https://github.com/CLIP-HPC/SlurmCommander/suites/10011747303/artifacts/486355715

β€” Reply to this email directly, view it on GitHub https://github.com/CLIP-HPC/SlurmCommander/issues/9#issuecomment-1362836696, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG5SE7KHJFVTEFZ4KTLAGLWORI65ANCNFSM6AAAAAATFNIW7M . You are receiving this because you authored the thread.Message ID: @.***>

-- Killian Murphy Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories University of York Heslington York YO10 5DD +44 (0)1904 32 3634

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm

pja237 commented 1 year ago

Great news, i'll merge and do a release, let me know whenever you got any new ideas/bugs πŸ‘