CLIP-HPC / SlurmCommander

Slurm TUI
MIT License
60 stars 6 forks source link

Failed JSON parsing in Slurm 22.05.4 #1

Closed kilicomu closed 1 year ago

kilicomu commented 1 year ago

Hi,

Just following up with what the program tells me to do:

Tried running the precompiled binary from the releases page. The 'Cluster' features seem to work but the job features don't seem to, e.g.

ERROR: squeue JSON failed to parse

Let me know if there is any useful troubleshooting I can do.

pja237 commented 1 year ago

Hey, thanks for the feedback, Due to the "fragility" of slurms json interface (changes they make are quite interesting), this has happened before. Can you try running it with DEBUG var turned on, e.g. DEBUG=1 ./scom Then you'll get an additional error line in the TUI, and also the scdebug.log file will be created, there you can grep for the exact error line. Usually it's a quick fix, so lets try with that.

kilicomu commented 1 year ago

Sure, done:

1 SC: 14:39:14.075837 logger.go:36: Log file: scdebug.log
  2 SC: 14:39:14.076149 jobtabcommands.go:56: QuickGetSqueue() start
  3 SC: 14:39:14.076174 clustertabcommands.go:56: QuickGetSinfo() start
  4 SC: 14:39:14.076555 view.go:98: Got NO error, insert newline
  5 SC: 14:39:14.076569 view.go:105: CALL JobTab.View()
  6 SC: 14:39:14.076583 jobtabview.go:169: IN JobTab.View()
  7 SC: 14:39:14.077379 command.go:38: Fetching UserName
  8 SC: 14:39:14.077533 update.go:342: Update: got WindowSizeMsg: 362 97
  9 SC: 14:39:14.079149 jobtab.go:53: FixTableHeight(97) from 20
 10 SC: 14:39:14.079164 jobtab.go:59: FixTableHeight to 82
 11 SC: 14:39:14.079171 jobhisttab.go:50: FixTableHeight(97) from 20
 12 SC: 14:39:14.079179 jobhisttab.go:56: FixTableHeight to 82
 13 SC: 14:39:14.079186 clustertab.go:43: FixTableHeight(97) from 20
 14 SC: 14:39:14.079193 clustertab.go:49: FixTableHeight to 72
 15 SC: 14:39:14.077387 jobfromtemplate.go:70: GetTemplateList reading dir: /users/klcm500/scom/templates
 16 SC: 14:39:14.079200 update.go:372: CTB Width = 0
 17 SC: 14:39:14.079242 update.go:374: CTB Width = 228
 18 SC: 14:39:14.078960 command.go:49: Return UserName: klcm500
 19 SC: 14:39:14.079680 view.go:98: Got NO error, insert newline
 20 SC: 14:39:14.079690 view.go:105: CALL JobTab.View()
 21 SC: 14:39:14.079699 jobtabview.go:169: IN JobTab.View()
 22 SC: 14:39:14.082400 update.go:291: Got UserNAme msg, save "klcm500" to Globals.
 23 SC: 14:39:14.082745 command.go:72: GetUserAssoc about to run: /usr/bin/sacctmgr [list Association format=account -P -n user=klcm500]
 24 SC: 14:39:14.082757 view.go:98: Got NO error, insert newline
 25 SC: 14:39:14.082850 view.go:105: CALL JobTab.View()
 26 SC: 14:39:14.082858 jobtabview.go:169: IN JobTab.View()
 27 SC: 14:39:14.085442 update.go:318: Update: Got TemplatesListRows msg: jobfromtemplate.TemplatesListRows(nil)
 28 SC: 14:39:14.085767 view.go:98: Got NO error, insert newline
 29 SC: 14:39:14.085777 view.go:105: CALL JobTab.View()
 30 SC: 14:39:14.085785 jobtabview.go:169: IN JobTab.View()
 31 SC: 14:39:14.170649 update.go:405: U(): got SinfoJSON
 32 SC: 14:39:14.170705 clustertabtable.go:75: FilterSinfoTable: rows 178
 33 SC: 14:39:14.181820 clustertab.go:63: GetStatsFiltered JobClusterTab start
 34 SC: 14:39:14.183179 clustertab.go:111: GetStatsFiltered end
 35 SC: 14:39:14.183607 view.go:98: Got NO error, insert newline
 36 SC: 14:39:14.183622 view.go:105: CALL JobTab.View()
 37 SC: 14:39:14.183634 jobtabview.go:169: IN JobTab.View()
 38 SC: 14:39:14.275519 command.go:94: Got UserAssoc klcm500 -> chem-acm-2018
 39 SC: 14:39:14.275559 command.go:94: Got UserAssoc klcm500 -> its-training-2019
 40 SC: 14:39:14.275579 command.go:94: Got UserAssoc klcm500 -> its-devel-2018
 41 SC: 14:39:14.275602 command.go:94: Got UserAssoc klcm500 -> its-system-2018
 42 SC: 14:39:14.275977 update.go:280: Got UserAssoc msg, value: command.UserAssoc{"chem-acm-2018", "its-training-2019", "its-devel-2018", "its-system-2018"}
 43 SC: 14:39:14.276034 update.go:283: Appended UserAssoc msg go Globals, value now: []string{"chem-acm-2018", "its-training-2019", "its-devel-2018", "its-system-2018"}
 44 SC: 14:39:14.276068 update.go:286: Appended UserAssoc msg go Globals, calling GetSacctHist()
 45 SC: 14:39:14.276291 jobhisttabcommands.go:38: GetSacctHist("chem-acm-2018,its-training-2019,its-devel-2018,its-system-2018") start: days 7, timeout: 30
 46 SC: 14:39:14.276356 jobhisttabcommands.go:50: EXEC: "/usr/bin/sacct" ["-n" "--json" "-S" "now-7days" "-A" "chem-acm-2018,its-training-2019,its-devel-2018,its-system-2018"]
 47 SC: 14:39:14.276735 view.go:98: Got NO error, insert newline
 48 SC: 14:39:14.276754 view.go:105: CALL JobTab.View()
 49 SC: 14:39:14.276768 jobtabview.go:169: IN JobTab.View()
 50 SC: 14:39:15.839383 jobhisttabcommands.go:59: EXEC returned: 7578660 bytes
 51 SC: 14:39:15.839466 jobhisttabcommands.go:64: Error unmarshall: "invalid character 's' looking for beginning of value"
 52 SC: 14:39:15.839543 update.go:262: ERROR msg, from: GetSacctHist
 53 SC: 14:39:15.839590 update.go:263: ERROR msg, original error: "invalid character 's' looking for beginning of value"
 54 SC: 14:39:15.840095 view.go:95: Got error
 55 SC: 14:39:15.840118 view.go:105: CALL JobTab.View()
 56 SC: 14:39:15.840131 jobtabview.go:169: IN JobTab.View()
 57 SC: 14:39:17.815098 update.go:262: ERROR msg, from: GetSqueue
 58 SC: 14:39:17.815400 update.go:263: ERROR msg, original error: "json: cannot unmarshal array into Go struct field V0039JobResources.Jobs.job_resources.allocated_nodes of type map[string]openapi.V0039NodeAllocation"
 59 SC: 14:39:17.815628 view.go:95: Got error
 60 SC: 14:39:17.815636 view.go:105: CALL JobTab.View()
 61 SC: 14:39:17.815641 jobtabview.go:169: IN JobTab.View()
 62 SC: 14:39:22.517698 view.go:95: Got error
 63 SC: 14:39:22.517735 view.go:105: CALL JobTab.View()
 64 SC: 14:39:22.517748 jobtabview.go:169: IN JobTab.View()
 65 SC: 14:39:22.521394 view.go:95: Got error
 66 SC: 14:39:22.521420 view.go:105: CALL JobTab.View()
 67 SC: 14:39:22.521431 jobtabview.go:169: IN JobTab.View()
kilicomu commented 1 year ago

Running the sacct outside of scom gives me this error:

sacct: error: _parser_dump: failed on field association: Unable to convert Data type

and then the JSON output. So I guess scom wants to see a { at the beginning of the sacct output but is seeing the 's' from the stderr?

pja237 commented 1 year ago

That is possible, let's just double-check this line:

58 SC: 14:39:17.815400 update.go:263: ERROR msg, original error: "json: cannot unmarshal array into Go struct field V0039JobResources.Jobs.job_resources.allocated_nodes of type map[string]openapi.V0039NodeAllocation"

Can you please cut out this part from the json you're getting and give me a sample, this is how it looks for me:

       "job_resources": {
         "nodes": "clip-c2-38",
         "allocated_cpus": 1,
         "allocated_hosts": 1,
         "allocated_nodes": {
           "0": {
             "sockets": {
               "1": "unassigned"
             },
             "cores": {
               "17": "unassigned"
             },
             "memory": 1024,
             "cpus": 1
           }
         }
       },

In the meantime i'll check the openapi.json for changes, that's also very possible to have happened.

pja237 commented 1 year ago

Can you download the build from the PR and give it a try: https://github.com/CLIP-HPC/SlurmCommander/suites/9973575104/artifacts/483581784

kcgthb commented 1 year ago

I can confirm the same findings with Slurm 22.05.6 and that #2 resolves the issue.

I also get the same warning:

jobhisttabcommands.go:64: Error unmarshall: "invalid character 's' looking for beginning of value"

which I guess is related to https://bugs.schedmd.com/show_bug.cgi?id=15334. sacct --json may report non-fatal errors like sacct: error: ... which will generate JSON parsing errors. Those messages are sent to stderr though, so filtering out stderr and only parsing stdout may be sufficient to solve the issue?

pja237 commented 1 year ago

Great findings and pointing the way :+1:

i've updated the PR to capture only stdout from both sacct and squeue calls https://github.com/CLIP-HPC/SlurmCommander/suites/9977768211/artifacts/483890657

Does this work for both of you now (22.05.4 and 22.05.6)?

kcgthb commented 1 year ago

Yes, the new changes in PR #2 seems to fix the issue for me, no more parsing errors reported. Thanks!

pja237 commented 1 year ago

Most excellent, thank you for the troubleshooting effort. I'll merge this and do a release, then tackle the other RFEs.

PS @kcgthb @kilicomu would you have anything against becoming "contributing" people on the about tab?

kcgthb commented 1 year ago

PS @kcgthb @kilicomu would you have anything against becoming "contributing" people on the about tab?

No problem for me, happy to contribute!

kilicomu commented 1 year ago

@pja237 No problem for me also.

Your patch seems to resolve the original issue, thanks, but I now see a different error. I'll open a separate issue for that!