ccfos / nightingale

An all-in-one observability solution which aims to combine the advantages of Prometheus and Grafana. It manages alert rules and visualizes metrics, logs, traces in a beautiful web UI.
https://flashcat.cloud/docs/
Apache License 2.0
9.64k stars 1.41k forks source link

任务界面执行kill会导致categraf报错退出 #1987

Closed rayn316 closed 2 weeks ago

rayn316 commented 3 months ago

Your config.toml

1

Relevant logs

Jun 11 18:07:55 categraf[1960524]: 2024/06/11 18:07:55 heartbeat.go:64: I! assigned tasks: [14095]
Jun 11 18:07:56 categraf[1960524]: 2024/06/11 18:07:56 heartbeat.go:64: I! assigned tasks: [14095]
Jun 11 18:07:57 categraf[1960524]: 2024/06/11 18:07:57 heartbeat.go:64: I! assigned tasks: [14095]
Jun 11 18:07:58 categraf[1960524]: 2024/06/11 18:07:58 heartbeat.go:64: I! assigned tasks: [14095]
Jun 11 18:07:59 categraf[1960524]: 2024/06/11 18:07:59 heartbeat.go:64: I! assigned tasks: [14095]
Jun 11 18:08:00 categraf[1960524]: 2024/06/11 18:08:00 heartbeat.go:64: I! assigned tasks: [14095]
Jun 11 18:08:00 categraf[1960524]: 2024/06/11 18:08:00 task.go:343: D! begin kill process of task[14095]
Jun 11 18:08:00 categraf[1960524]: panic: runtime error: invalid memory address or nil pointer dereference
Jun 11 18:08:00 categraf[1960524]: [signal SIGSEGV: segmentation violation code=0x1 addr=0xa0 pc=0xbedb3d]
Jun 11 18:08:00 categraf[1960524]: goroutine 6733953 [running]:
Jun 11 18:08:00 categraf[1960524]: flashcat.cloud/categraf/ibex.CmdKill(...)
Jun 11 18:08:00 categraf[1960524]:         /home/runner/work/categraf/categraf/ibex/cmd_nix.go:16
Jun 11 18:08:00 categraf[1960524]: flashcat.cloud/categraf/ibex.killProcess(0xc00250c0d0)
Jun 11 18:08:00 categraf[1960524]:         /home/runner/work/categraf/categraf/ibex/task.go:345 +0x11d
Jun 11 18:08:00 categraf[1960524]: created by flashcat.cloud/categraf/ibex.(*Task).kill in goroutine 401
Jun 11 18:08:00 categraf[1960524]:         /home/runner/work/categraf/categraf/ibex/task.go:299 +0x4f
Jun 11 18:08:00 systemd[1]: categraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

System info

categraf v0.3.69

Steps to reproduce

  1. 因为服务器上没有执行脚本内容,看起来没有执行,服务器上没有脚本进程,然后超时,先将执行超时或者错误的服务器执行暂停或者跳过
  2. 最后有一台没有执行,最后点击全体kill操作,导致上面暂停或者跳过的服务器categraf崩溃
  3. ...

Expected behavior

服务正常执行,执行kill后categraf正常运行

Actual behavior

执行kill后categraf崩溃

Additional info

No response

rayn316 commented 3 months ago

不知道是怎么回事,在出现这个错误之后,报错的几台categraf一直起不来 然后发现categraf的目录/usr/local/categra不见了,看起来目录被删除了,不知道是不是这个kill错误导致的

kongfei605 commented 3 months ago

脚本中执行的什么? 删除/usr/local/categraf ? 任务内容呢?

rayn316 commented 3 months ago

脚本中没有删除/usr/local/categraf,可能是其它地方做的

rayn316 commented 3 months ago

有时候categraf会报错误,导致一直上传任务结果失败,然后夜莺任务显示一直处于running状态

Jun 13 12:00:06 categraf[1280617]: 2024/06/13 12:00:06 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1
Jun 13 12:00:08 categraf[1280617]: 2024/06/13 12:00:08 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1
Jun 13 12:00:09 categraf[1280617]: 2024/06/13 12:00:09 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1
Jun 13 12:00:11 categraf[1280617]: 2024/06/13 12:00:11 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1
Jun 13 12:00:13 categraf[1280617]: 2024/06/13 12:00:13 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1
Jun 13 12:00:14 categraf[1280617]: 2024/06/13 12:00:14 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1
rayn316 commented 3 months ago

只有把执行任务的输出文件手动重置,再重启categraf才会正常

UlricQin commented 3 months ago

有时候categraf会报错误,导致一直上传任务结果失败,然后夜莺任务显示一直处于running状态

Jun 13 12:00:06 categraf[1280617]: 2024/06/13 12:00:06 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1
Jun 13 12:00:08 categraf[1280617]: 2024/06/13 12:00:08 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1
Jun 13 12:00:09 categraf[1280617]: 2024/06/13 12:00:09 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1
Jun 13 12:00:11 categraf[1280617]: 2024/06/13 12:00:11 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1
Jun 13 12:00:13 categraf[1280617]: 2024/06/13 12:00:13 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1
Jun 13 12:00:14 categraf[1280617]: 2024/06/13 12:00:14 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\x82\xE9\x94\x99\xE8\xAF...' for column 'stdout' at row 1

这个报错看起来是数据库中表的charset不对,可以检查一下那一百张存放结果的表,如果是latin1就会有问题,可以改成utf8mb4

rayn316 commented 3 months ago

查询结果都是utf8mb4_0900_ai_ci

mysql> SELECT table_name, table_collation
    -> FROM information_schema.TABLES
    -> WHERE table_schema = 'n9e_v6';
+-----------------------+--------------------+
| TABLE_NAME            | TABLE_COLLATION    |
+-----------------------+--------------------+
| alert_aggr_view       | utf8mb4_0900_ai_ci |
| alert_cur_event       | utf8mb4_0900_ai_ci |
| alert_his_event       | utf8mb4_0900_ai_ci |
| alert_mute            | utf8mb4_0900_ai_ci |
| alert_rule            | utf8mb4_0900_ai_ci |
| alert_subscribe       | utf8mb4_0900_ai_ci |
| alerting_engines      | utf8mb4_0900_ai_ci |
| board                 | utf8mb4_0900_ai_ci |
| board_busigroup       | utf8mb4_0900_ai_ci |
| board_payload         | utf8mb4_0900_ai_ci |
| builtin_cate          | utf8mb4_0900_ai_ci |
| builtin_components    | utf8mb4_0900_ai_ci |
| builtin_metrics       | utf8mb4_0900_ai_ci |
| builtin_payloads      | utf8mb4_0900_ai_ci |
| busi_group            | utf8mb4_0900_ai_ci |
| busi_group_member     | utf8mb4_0900_ai_ci |
| chart                 | utf8mb4_0900_ai_ci |
| chart_group           | utf8mb4_0900_ai_ci |
| chart_share           | utf8mb4_0900_ai_ci |
| configs               | utf8mb4_0900_ai_ci |
rayn316 commented 3 months ago

还是会这样,有时候执行任务超时,上服务器看categraf,一直报这这种错误 Jun 18 10:49:20 categraf[1232721]: 2024/06/18 10:49:20 heartbeat.go:48: E! error from server: Error 1366: Incorrect string value: '\xB6\xE8\xBF\x9F: ...' for column 'stdout' at row 1

UlricQin commented 3 months ago

那没有别的思路了,我的认知里这个错误就是字符集的问题,问 gpt 也是类似的回复:

image

或许,也可能是你的脚本输出的内容不是 utf8 可以解析的

image
rayn316 commented 3 months ago

看输出文本和其它正常写入节点日志都是一样的,看不出来特殊字符

可以让categraf遇到这种无法写入的字符串,跳过或者无视吗 比如统一设置为替代符 [] 之类的替换

rayn316 commented 3 months ago

不然后面一遇到特殊字符就要找的字符集问题 categraf还要手动重启,这种也不行

UlricQin commented 3 months ago

这是服务端的逻辑,服务端负责写数据库,后面可以在ibex里做这个容错处理

秦晓辉 @.***

快猫星云 联合创始人 18612185520

------------------ 原始邮件 ------------------ 发件人: 赵尚 @.> 发送时间: 2024年6月18日 11:39 收件人: ccfos/nightingale @.> 抄送: ulricqin @.>, Comment @.> 主题: Re: [ccfos/nightingale] 任务界面执行kill会导致categraf报错退出 (Issue #1987)

看输出文本和其它正常写入节点日志都是一样的,看不出来特殊字符

可以让categraf遇到这种无法写入的字符串,跳过或者无视吗 比如统一设置为替代符 [] 之类的替换

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

rayn316 commented 3 months ago

将输出下载下来手动写入数据库后,发现提示超出大小,可能是日志大于 text 规定的大小,也许设置为longtext就不会写入失败了

mysql> SHOW COLUMNS FROM task_host_0;
+--------+-----------------+------+-----+---------+----------------+
| Field  | Type            | Null | Key | Default | Extra          |
+--------+-----------------+------+-----+---------+----------------+
| ii     | bigint unsigned | NO   | PRI | NULL    | auto_increment |
| id     | bigint unsigned | NO   | MUL | NULL    |                |
| host   | varchar(128)    | NO   |     | NULL    |                |
| status | varchar(32)     | NO   |     | NULL    |                |
| stdout | text            | YES  |     | NULL    |                |
| stderr | text            | YES  |     | NULL    |                |
+--------+-----------------+------+-----+---------+----------------+
6 rows in set (0.00 sec)
rayn316 commented 2 months ago

https://github.com/ccfos/nightingale/pull/2027 提交了更改,但是没找到默认n9e.sql文件在哪里,你们有时间可以改下

UlricQin commented 2 weeks ago

https://github.com/flashcatcloud/categraf/commit/e84edc7a1fa925b52ed77b9cc0950c1a7ed999e1