Closed kilicomu closed 1 year ago
Hey,
thanks for the feedback,
Due to the "fragility" of slurms json interface (changes they make are quite interesting), this has happened before.
Can you try running it with DEBUG var turned on, e.g. DEBUG=1 ./scom
Then you'll get an additional error line in the TUI, and also the scdebug.log file will be created, there you can grep for the exact error line. Usually it's a quick fix, so lets try with that.
Sure, done:
1 SC: 14:39:14.075837 logger.go:36: Log file: scdebug.log
2 SC: 14:39:14.076149 jobtabcommands.go:56: QuickGetSqueue() start
3 SC: 14:39:14.076174 clustertabcommands.go:56: QuickGetSinfo() start
4 SC: 14:39:14.076555 view.go:98: Got NO error, insert newline
5 SC: 14:39:14.076569 view.go:105: CALL JobTab.View()
6 SC: 14:39:14.076583 jobtabview.go:169: IN JobTab.View()
7 SC: 14:39:14.077379 command.go:38: Fetching UserName
8 SC: 14:39:14.077533 update.go:342: Update: got WindowSizeMsg: 362 97
9 SC: 14:39:14.079149 jobtab.go:53: FixTableHeight(97) from 20
10 SC: 14:39:14.079164 jobtab.go:59: FixTableHeight to 82
11 SC: 14:39:14.079171 jobhisttab.go:50: FixTableHeight(97) from 20
12 SC: 14:39:14.079179 jobhisttab.go:56: FixTableHeight to 82
13 SC: 14:39:14.079186 clustertab.go:43: FixTableHeight(97) from 20
14 SC: 14:39:14.079193 clustertab.go:49: FixTableHeight to 72
15 SC: 14:39:14.077387 jobfromtemplate.go:70: GetTemplateList reading dir: /users/klcm500/scom/templates
16 SC: 14:39:14.079200 update.go:372: CTB Width = 0
17 SC: 14:39:14.079242 update.go:374: CTB Width = 228
18 SC: 14:39:14.078960 command.go:49: Return UserName: klcm500
19 SC: 14:39:14.079680 view.go:98: Got NO error, insert newline
20 SC: 14:39:14.079690 view.go:105: CALL JobTab.View()
21 SC: 14:39:14.079699 jobtabview.go:169: IN JobTab.View()
22 SC: 14:39:14.082400 update.go:291: Got UserNAme msg, save "klcm500" to Globals.
23 SC: 14:39:14.082745 command.go:72: GetUserAssoc about to run: /usr/bin/sacctmgr [list Association format=account -P -n user=klcm500]
24 SC: 14:39:14.082757 view.go:98: Got NO error, insert newline
25 SC: 14:39:14.082850 view.go:105: CALL JobTab.View()
26 SC: 14:39:14.082858 jobtabview.go:169: IN JobTab.View()
27 SC: 14:39:14.085442 update.go:318: Update: Got TemplatesListRows msg: jobfromtemplate.TemplatesListRows(nil)
28 SC: 14:39:14.085767 view.go:98: Got NO error, insert newline
29 SC: 14:39:14.085777 view.go:105: CALL JobTab.View()
30 SC: 14:39:14.085785 jobtabview.go:169: IN JobTab.View()
31 SC: 14:39:14.170649 update.go:405: U(): got SinfoJSON
32 SC: 14:39:14.170705 clustertabtable.go:75: FilterSinfoTable: rows 178
33 SC: 14:39:14.181820 clustertab.go:63: GetStatsFiltered JobClusterTab start
34 SC: 14:39:14.183179 clustertab.go:111: GetStatsFiltered end
35 SC: 14:39:14.183607 view.go:98: Got NO error, insert newline
36 SC: 14:39:14.183622 view.go:105: CALL JobTab.View()
37 SC: 14:39:14.183634 jobtabview.go:169: IN JobTab.View()
38 SC: 14:39:14.275519 command.go:94: Got UserAssoc klcm500 -> chem-acm-2018
39 SC: 14:39:14.275559 command.go:94: Got UserAssoc klcm500 -> its-training-2019
40 SC: 14:39:14.275579 command.go:94: Got UserAssoc klcm500 -> its-devel-2018
41 SC: 14:39:14.275602 command.go:94: Got UserAssoc klcm500 -> its-system-2018
42 SC: 14:39:14.275977 update.go:280: Got UserAssoc msg, value: command.UserAssoc{"chem-acm-2018", "its-training-2019", "its-devel-2018", "its-system-2018"}
43 SC: 14:39:14.276034 update.go:283: Appended UserAssoc msg go Globals, value now: []string{"chem-acm-2018", "its-training-2019", "its-devel-2018", "its-system-2018"}
44 SC: 14:39:14.276068 update.go:286: Appended UserAssoc msg go Globals, calling GetSacctHist()
45 SC: 14:39:14.276291 jobhisttabcommands.go:38: GetSacctHist("chem-acm-2018,its-training-2019,its-devel-2018,its-system-2018") start: days 7, timeout: 30
46 SC: 14:39:14.276356 jobhisttabcommands.go:50: EXEC: "/usr/bin/sacct" ["-n" "--json" "-S" "now-7days" "-A" "chem-acm-2018,its-training-2019,its-devel-2018,its-system-2018"]
47 SC: 14:39:14.276735 view.go:98: Got NO error, insert newline
48 SC: 14:39:14.276754 view.go:105: CALL JobTab.View()
49 SC: 14:39:14.276768 jobtabview.go:169: IN JobTab.View()
50 SC: 14:39:15.839383 jobhisttabcommands.go:59: EXEC returned: 7578660 bytes
51 SC: 14:39:15.839466 jobhisttabcommands.go:64: Error unmarshall: "invalid character 's' looking for beginning of value"
52 SC: 14:39:15.839543 update.go:262: ERROR msg, from: GetSacctHist
53 SC: 14:39:15.839590 update.go:263: ERROR msg, original error: "invalid character 's' looking for beginning of value"
54 SC: 14:39:15.840095 view.go:95: Got error
55 SC: 14:39:15.840118 view.go:105: CALL JobTab.View()
56 SC: 14:39:15.840131 jobtabview.go:169: IN JobTab.View()
57 SC: 14:39:17.815098 update.go:262: ERROR msg, from: GetSqueue
58 SC: 14:39:17.815400 update.go:263: ERROR msg, original error: "json: cannot unmarshal array into Go struct field V0039JobResources.Jobs.job_resources.allocated_nodes of type map[string]openapi.V0039NodeAllocation"
59 SC: 14:39:17.815628 view.go:95: Got error
60 SC: 14:39:17.815636 view.go:105: CALL JobTab.View()
61 SC: 14:39:17.815641 jobtabview.go:169: IN JobTab.View()
62 SC: 14:39:22.517698 view.go:95: Got error
63 SC: 14:39:22.517735 view.go:105: CALL JobTab.View()
64 SC: 14:39:22.517748 jobtabview.go:169: IN JobTab.View()
65 SC: 14:39:22.521394 view.go:95: Got error
66 SC: 14:39:22.521420 view.go:105: CALL JobTab.View()
67 SC: 14:39:22.521431 jobtabview.go:169: IN JobTab.View()
Running the sacct
outside of scom
gives me this error:
sacct: error: _parser_dump: failed on field association: Unable to convert Data type
and then the JSON output. So I guess scom
wants to see a {
at the beginning of the sacct
output but is seeing the 's' from the stderr?
That is possible, let's just double-check this line:
58 SC: 14:39:17.815400 update.go:263: ERROR msg, original error: "json: cannot unmarshal array into Go struct field V0039JobResources.Jobs.job_resources.allocated_nodes of type map[string]openapi.V0039NodeAllocation"
Can you please cut out this part from the json you're getting and give me a sample, this is how it looks for me:
"job_resources": {
"nodes": "clip-c2-38",
"allocated_cpus": 1,
"allocated_hosts": 1,
"allocated_nodes": {
"0": {
"sockets": {
"1": "unassigned"
},
"cores": {
"17": "unassigned"
},
"memory": 1024,
"cpus": 1
}
}
},
In the meantime i'll check the openapi.json for changes, that's also very possible to have happened.
Can you download the build from the PR and give it a try: https://github.com/CLIP-HPC/SlurmCommander/suites/9973575104/artifacts/483581784
I can confirm the same findings with Slurm 22.05.6 and that #2 resolves the issue.
I also get the same warning:
jobhisttabcommands.go:64: Error unmarshall: "invalid character 's' looking for beginning of value"
which I guess is related to https://bugs.schedmd.com/show_bug.cgi?id=15334. sacct --json
may report non-fatal errors like sacct: error: ...
which will generate JSON parsing errors. Those messages are sent to stderr
though, so filtering out stderr
and only parsing stdout
may be sufficient to solve the issue?
Great findings and pointing the way :+1:
i've updated the PR to capture only stdout from both sacct and squeue calls https://github.com/CLIP-HPC/SlurmCommander/suites/9977768211/artifacts/483890657
Does this work for both of you now (22.05.4 and 22.05.6)?
Yes, the new changes in PR #2 seems to fix the issue for me, no more parsing errors reported. Thanks!
Most excellent, thank you for the troubleshooting effort. I'll merge this and do a release, then tackle the other RFEs.
PS @kcgthb @kilicomu would you have anything against becoming "contributing" people on the about tab?
PS @kcgthb @kilicomu would you have anything against becoming "contributing" people on the about tab?
No problem for me, happy to contribute!
@pja237 No problem for me also.
Your patch seems to resolve the original issue, thanks, but I now see a different error. I'll open a separate issue for that!
Hi,
Just following up with what the program tells me to do:
Tried running the precompiled binary from the releases page. The 'Cluster' features seem to work but the job features don't seem to, e.g.
Let me know if there is any useful troubleshooting I can do.