AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
5.3k stars 170 forks source link

[BUG] ApiServerCommunicationError #220

Closed mariusberget92 closed 2 years ago

mariusberget92 commented 2 years ago

Describe the bug Followed the manual installation documentation and did everything as per the docs. I do see the web app from the browser, so that works fine and is setup to run at start as a service.

When trying to run the collector though; I get this error:

INFO[0000] Verifying required tools                      type=metrics
INFO[0000] Executing command: smartctl --scan -j         type=metrics
INFO[0000] Executing command: smartctl --info -j /dev/sdh  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j /dev/sda  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j /dev/sdb  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j /dev/sdc  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j /dev/sdd  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j /dev/sde  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j /dev/sdf  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j /dev/sdg  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Sending detected devices to API, for filtering & validation  type=metrics
ERRO[0000] An error occurred while retrieving filtered devices  type=metrics
2022/03/29 19:34:48 ERROR: ApiServerCommunicationError: "An error occurred while retrieving filtered devices"

I double checked that the port was open (6363 as I configured it to use). I can do an empty POST request using curl and that gives me back the website source code as a response. Don't know if I did something wrong or something. Thanks in advance.

Log Files Debug file attached. collector.log

SolomonSklash commented 2 years ago

I'm having this exact issue a well.

mariusberget92 commented 2 years ago

I'm having this exact issue a well.

Yea, I even tried the default port and it didn't work either.

AnalogJ commented 2 years ago

Looks like I may need to add some more debug level logging to the collector when it pushes data to the api server. The logs don't really specify if the error was due to communication issue or a failed response from the api server.

Do you have any logs from the api server? can you determine if the api server received a post request from the collector around the same time?

mariusberget92 commented 2 years ago

Looks like I may need to add some more debug level logging to the collector when it pushes data to the api server. The logs don't really specify if the error was due to communication issue or a failed response from the api server.

Do you have any logs from the api server? can you determine if the api server received a post request from the collector around the same time?

I have no idea. I could test that if I got some guidance. 😂 But I know the website runs on the specified port since I can access it. It's just empty without any devices since there is an API communication error.

sillmnvg commented 2 years ago

Got the same issue when I tried to spin it up on my server (also used my own port instead of the default, switched back to 8080 in an attempt to troubleshoot)

AnalogJ commented 2 years ago

can you paste your config file/environmental variables/docker-compose file? @sillmnvg @mariusberget92

mariusberget92 commented 2 years ago
scrutiny:
    image: ghcr.io/analogj/scrutiny:master-omnibus
    container_name: scrutiny
    cap_add:
      - SYS_RAWIO
    volumes:
      - /run/udev:/run/udev:ro
      - /home/anoneemo/docker/scrutiny/config:/opt/scrutiny/config
      - /home/anoneemo/docker/scrutiny/influxdb:/opt/scrutiny/influxdb
    ports:
      - '8080:8080'
      - '8086:8086'
    devices:
      - '/dev/sda'
      - '/dev/sdb'
      - '/dev/sdc'
      - '/dev/sdd'
      - '/dev/sde'
      - '/dev/sdf'
      - '/dev/sdg'
      - '/dev/sdh'
    restart: always
AnalogJ commented 2 years ago

do you have a config file in /home/anoneemo/docker/scrutiny/config ?

mariusberget92 commented 2 years ago

do you have a config file in /home/anoneemo/docker/scrutiny/config ?

No config. only a database file.

AnalogJ commented 2 years ago

Can you enable DEBUG mode and log files using one of the two methods below

docker run command:


-e DEBUG=true \
-e COLLECTOR_LOG_FILE= /opt/scrutiny/config/collector.log \
-e SCRUTINY_LOG_FILE=/opt/scrutiny/config/web.log \

docker-compose:

environment:
    DEBUG: true
    COLLECTOR_LOG_FILE: /opt/scrutiny/config/collector.log
    SCRUTINY_LOG_FILE: /opt/scrutiny/config/web.log

the log files will be written to the mount path of your config directory on your host. Just attach them to this issue.

mariusberget92 commented 2 years ago

Oh snap, sorry, been on autopilot when replying. I switched to docker-compose, and that works (the master-omnibus tag atleast).

Can you enable DEBUG mode and log files using one of the two methods below

@sillmnvg will need to help out there since it works on my machine. 😆

AnalogJ commented 2 years ago

hm. ok in that case I'll close this issue.

@sillmnvg if you're still running into this problem, please comment/reopen this issue :)

KF5JWC commented 2 years ago

I've just run into this!

Restarting scrutiny (the [scrutiny+influxdb] pod in my case) fixed this issue


For posterity I am leaving my logs.

TL;DR: database is locked (5) (SQLITE_BUSY)

Client/Collector:

time="2022-08-01T18:55:02-05:00" level=info msg="Sending detected devices to API, for filtering & validation" type=metrics
time="2022-08-01T18:55:07-05:00" level=error msg="An error occurred while retrieving filtered devices" type=metrics
2022/08/01 18:55:07 ERROR: ApiServerCommunicationError: "An error occurred while retrieving filtered devices"

Server:

2022/08/01 18:55:07 /go/src/github.com/analogj/scrutiny/webapp/backend/pkg/database/scrutiny_repository_device.go:23 database is locked (5) (SQLITE_BUSY)
[1.007ms] [rows:0] INSERT INTO `devices` (`created_at`,`updated_at`,`deleted_at`,`wwn`,`device_name`,`device_uuid`,`device_serial_id`,`device_label`,`manufacturer`,`model_name`,`interface_type`,`interface_speed`,`serial_number`,`firmware`,`rotation_speed`,`capacity`,`form_factor`,`smart_support`,`device_protocol`,`device_type`,`label`,`host_id`,`device_status`) VALUES ("2022-08-01 18:55:07.13","2022-08-01 18:55:07.13",NULL,"","nvme1n1","","","","","","","","","",0,,"",false,"NVMe","nvme","","synology.local","0") ON CONFLICT (`wwn`) DO UPDATE SET `host_id`=`excluded`.`host_id`,`device_name`=`excluded`.`device_name`,`device_type`=`excluded`.`device_type`,`device_uuid`=`excluded`.`device_uuid`,`device_serial_id`=`excluded`.`device_serial_id`,`device_label`=`excluded`.`device_label`

2022/08/01 18:55:07 /go/src/github.com/analogj/scrutiny/webapp/backend/pkg/database/scrutiny_repository_device.go:23 database is locked (5) (SQLITE_BUSY)
[0.509ms] [rows:0] INSERT INTO `devices` (`created_at`,`updated_at`,`deleted_at`,`wwn`,`device_name`,`device_uuid`,`device_serial_id`,`device_label`,`manufacturer`,`model_name`,`interface_type`,`interface_speed`,`serial_number`,`firmware`,`rotation_speed`,`capacity`,`form_factor`,`smart_support`,`device_protocol`,`device_type`,`label`,`host_id`,`device_status`) VALUES ("2022-08-01 18:55:07.131","2022-08-01 18:55:07.131",NULL,"","nvme0","","","","","","","","","",0,,"",false,"NVMe","nvme","","synology.local","0") ON CONFLICT (`wwn`) DO UPDATE SET `host_id`=`excluded`.`host_id`,`device_name`=`excluded`.`device_name`,`device_type`=`excluded`.`device_type`,`device_uuid`=`excluded`.`device_uuid`,`device_serial_id`=`excluded`.`device_serial_id`,`device_label`=`excluded`.`device_label`

2022/08/01 18:55:07 /go/src/github.com/analogj/scrutiny/webapp/backend/pkg/database/scrutiny_repository_device.go:23 database is locked (5) (SQLITE_BUSY)
[0.174ms] [rows:0] INSERT INTO `devices` (`created_at`,`updated_at`,`deleted_at`,`wwn`,`device_name`,`device_uuid`,`device_serial_id`,`device_label`,`manufacturer`,`model_name`,`interface_type`,`interface_speed`,`serial_number`,`firmware`,`rotation_speed`,`capacity`,`form_factor`,`smart_support`,`device_protocol`,`device_type`,`label`,`host_id`,`device_status`) VALUES ("2022-08-01 18:55:07.132","2022-08-01 18:55:07.132",NULL,"","nvme1","","","","","","","","","",0,,"",false,"NVMe","nvme","","synology.local","0") ON CONFLICT (`wwn`) DO UPDATE SET `host_id`=`excluded`.`host_id`,`device_name`=`excluded`.`device_name`,`device_type`=`excluded`.`device_type`,`device_uuid`=`excluded`.`device_uuid`,`device_serial_id`=`excluded`.`device_serial_id`,`device_label`=`excluded`.`device_label`

2022/08/01 18:55:07 /go/src/github.com/analogj/scrutiny/webapp/backend/pkg/database/scrutiny_repository_device.go:23 database is locked (5) (SQLITE_BUSY)
[0.225ms] [rows:0] INSERT INTO `devices` (`created_at`,`updated_at`,`deleted_at`,`wwn`,`device_name`,`device_uuid`,`device_serial_id`,`device_label`,`manufacturer`,`model_name`,`interface_type`,`interface_speed`,`serial_number`,`firmware`,`rotation_speed`,`capacity`,`form_factor`,`smart_support`,`device_protocol`,`device_type`,`label`,`host_id`,`device_status`) VALUES ("2022-08-01 18:55:07.132","2022-08-01 18:55:07.132",NULL,"","sata1","","","","","","","6.0 Gb/s","","",7200,,"3.5 inches",false,"ATA","sat","","synology.local","0") ON CONFLICT (`wwn`) DO UPDATE SET `host_id`=`excluded`.`host_id`,`device_name`=`excluded`.`device_name`,`device_type`=`excluded`.`device_type`,`device_uuid`=`excluded`.`device_uuid`,`device_serial_id`=`excluded`.`device_serial_id`,`device_label`=`excluded`.`device_label`

2022/08/01 18:55:07 /go/src/github.com/analogj/scrutiny/webapp/backend/pkg/database/scrutiny_repository_device.go:23 database is locked (5) (SQLITE_BUSY)
[0.213ms] [rows:0] INSERT INTO `devices` (`created_at`,`updated_at`,`deleted_at`,`wwn`,`device_name`,`device_uuid`,`device_serial_id`,`device_label`,`manufacturer`,`model_name`,`interface_type`,`interface_speed`,`serial_number`,`firmware`,`rotation_speed`,`capacity`,`form_factor`,`smart_support`,`device_protocol`,`device_type`,`label`,`host_id`,`device_status`) VALUES ("2022-08-01 18:55:07.133","2022-08-01 18:55:07.133",NULL,"","sata2","","","","","","","6.0 Gb/s","","",7200,,"3.5 inches",false,"ATA","sat","","synology.local","0") ON CONFLICT (`wwn`) DO UPDATE SET `host_id`=`excluded`.`host_id`,`device_name`=`excluded`.`device_name`,`device_type`=`excluded`.`device_type`,`device_uuid`=`excluded`.`device_uuid`,`device_serial_id`=`excluded`.`device_serial_id`,`device_label`=`excluded`.`device_label`

2022/08/01 18:55:07 /go/src/github.com/analogj/scrutiny/webapp/backend/pkg/database/scrutiny_repository_device.go:23 database is locked (5) (SQLITE_BUSY)
[0.189ms] [rows:0] INSERT INTO `devices` (`created_at`,`updated_at`,`deleted_at`,`wwn`,`device_name`,`device_uuid`,`device_serial_id`,`device_label`,`manufacturer`,`model_name`,`interface_type`,`interface_speed`,`serial_number`,`firmware`,`rotation_speed`,`capacity`,`form_factor`,`smart_support`,`device_protocol`,`device_type`,`label`,`host_id`,`device_status`) VALUES ("2022-08-01 18:55:07.133","2022-08-01 18:55:07.133",NULL,"","sata3","","","","","","","6.0 Gb/s","","",5425,,"3.5 inches",false,"ATA","sat","","synology.local","0") ON CONFLICT (`wwn`) DO UPDATE SET `host_id`=`excluded`.`host_id`,`device_name`=`excluded`.`device_name`,`device_type`=`excluded`.`device_type`,`device_uuid`=`excluded`.`device_uuid`,`device_serial_id`=`excluded`.`device_serial_id`,`device_label`=`excluded`.`device_label`

2022/08/01 18:55:07 /go/src/github.com/analogj/scrutiny/webapp/backend/pkg/database/scrutiny_repository_device.go:23 database is locked (5) (SQLITE_BUSY)
[0.219ms] [rows:0] INSERT INTO `devices` (`created_at`,`updated_at`,`deleted_at`,`wwn`,`device_name`,`device_uuid`,`device_serial_id`,`device_label`,`manufacturer`,`model_name`,`interface_type`,`interface_speed`,`serial_number`,`firmware`,`rotation_speed`,`capacity`,`form_factor`,`smart_support`,`device_protocol`,`device_type`,`label`,`host_id`,`device_status`) VALUES ("2022-08-01 18:55:07.134","2022-08-01 18:55:07.134",NULL,"","sata4","","","","","","","6.0 Gb/s","","",5425,,"3.5 inches",false,"ATA","sat","","synology.local","0") ON CONFLICT (`wwn`) DO UPDATE SET `host_id`=`excluded`.`host_id`,`device_name`=`excluded`.`device_name`,`device_type`=`excluded`.`device_type`,`device_uuid`=`excluded`.`device_uuid`,`device_serial_id`=`excluded`.`device_serial_id`,`device_label`=`excluded`.`device_label`

2022/08/01 18:55:07 /go/src/github.com/analogj/scrutiny/webapp/backend/pkg/database/scrutiny_repository_device.go:23 database is locked (5) (SQLITE_BUSY)
[0.215ms] [rows:0] INSERT INTO `devices` (`created_at`,`updated_at`,`deleted_at`,`wwn`,`device_name`,`device_uuid`,`device_serial_id`,`device_label`,`manufacturer`,`model_name`,`interface_type`,`interface_speed`,`serial_number`,`firmware`,`rotation_speed`,`capacity`,`form_factor`,`smart_support`,`device_protocol`,`device_type`,`label`,`host_id`,`device_status`) VALUES ("2022-08-01 18:55:07.134","2022-08-01 18:55:07.134",NULL,"","nvme0n1","","","","","","","","","",0,,"",false,"NVMe","nvme","","synology.local","0") ON CONFLICT (`wwn`) DO UPDATE SET `host_id`=`excluded`.`host_id`,`device_name`=`excluded`.`device_name`,`device_type`=`excluded`.`device_type`,`device_uuid`=`excluded`.`device_uuid`,`device_serial_id`=`excluded`.`device_serial_id`,`device_label`=`excluded`.`device_label`
time="2022-08-01T18:55:07-05:00" level=error msg="An error occurred while registering devices [database is locked (5) (SQLITE_BUSY) database is locked (5) (SQLITE_BUSY) database is locked (5) (SQLITE_BUSY) database is locked (5) (SQLITE_BUSY) database is locked (5) (SQLITE_BUSY) database is locked (5) (SQLITE_BUSY) database is locked (5) (SQLITE_BUSY) database is locked (5) (SQLITE_BUSY)]"
time="2022-08-01T18:55:07-05:00" level=error msg="10.2.2.1 - scrutiny-5894d5497b-xpj94 [01/Aug/2022:18:55:07 -0500] \"POST /api/devices/register\" 500 17 \"\" \"Go-http-client/2.0\" (6ms)" clientIP=10.2.2.1 hostname=scrutiny-5894d5497b-xpj94 latency=6 method=POST path=/api/devices/register referer= respLength=17 statusCode=500 userAgent=Go-http-client/2.0
KF5JWC commented 2 years ago

I can sorta reproduce this. It waffled between 2 database errors (lock, nested transaction).

I have 3 clients/collectors. They all share a NTP server, and they all run at the top of the minute (for now).

It looks like clients are not "safely" served concurrently:

SQL logic error: cannot start a transaction within a transaction (1)

Including this because it may be relevant for the environment I am running in:

2022/08/01 18:19:27 /go/src/github.com/analogj/scrutiny/webapp/backend/pkg/database/scrutiny_repository_device.go:33 SLOW SQL >= 200ms
[1300.125ms] [rows:11] SELECT * FROM `devices`
KF5JWC commented 2 years ago

Spacing each of them out by 2 minutes works fine.

time="2022-08-01T22:00:01-05:00" level=info msg="10.2.2.1 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:00:01 -0500] \"POST /api/devices/register\" 200 1655 \"\" \"Go-http-client/2.0\" (16ms)" clientIP=10.2.2.1 hostname=scrutiny-5894d5497b-mz2ml latency=16 method=POST ...
time="2022-08-01T22:00:02-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:00:02 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (148ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=148 method=POST ...
time="2022-08-01T22:00:02-05:00" level=info msg="10.2.2.1 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:00:02 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (98ms)" clientIP=10.2.2.1 hostname=scrutiny-5894d5497b-mz2ml latency=98 method=POST ...
time="2022-08-01T22:00:02-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:00:02 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (87ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=87 method=POST ...
time="2022-08-01T22:02:01-05:00" level=info msg="10.2.2.1 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:02:01 -0500] \"POST /api/devices/register\" 200 1141 \"\" \"Go-http-client/2.0\" (3ms)" clientIP=10.2.2.1 hostname=scrutiny-5894d5497b-mz2ml latency=3 method=POST ...
time="2022-08-01T22:02:04-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:02:04 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (1927ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=1927 method=POST ...
time="2022-08-01T22:02:06-05:00" level=info msg="10.2.2.1 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:02:06 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (1743ms)" clientIP=10.2.2.1 hostname=scrutiny-5894d5497b-mz2ml latency=1743 method=POST ...
time="2022-08-01T22:04:02-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:04:02 -0500] \"POST /api/devices/register\" 200 4384 \"\" \"Go-http-client/2.0\" (141ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=141 method=POST ...
time="2022-08-01T22:04:04-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:04:04 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (1634ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=1634 method=POST ...
time="2022-08-01T22:04:06-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:04:06 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (1904ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=1904 method=POST ...
time="2022-08-01T22:04:06-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:04:06 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (56ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=56 method=POST ...
time="2022-08-01T22:04:07-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:04:07 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (203ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=203 method=POST ...
time="2022-08-01T22:04:07-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:04:07 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (108ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=108 method=POST ...
time="2022-08-01T22:04:07-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:04:07 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (69ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=69 method=POST ...
time="2022-08-01T22:04:09-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:04:09 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (1558ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=1558 method=POST ...
time="2022-08-01T22:04:11-05:00" level=info msg="1.2.3.4 - scrutiny-5894d5497b-mz2ml [01/Aug/2022:22:04:11 -0500] \"POST /apidevice/abcdef/smart\" 200 16 \"\" \"Go-http-client/2.0\" (1596ms)" clientIP=1.2.3.4 hostname=scrutiny-5894d5497b-mz2ml latency=1596 method=POST ...

I think it might be notable that there are multiple requests which have a latency >1500ms, which may be leading to the SQLITE_BUSY error/timeout.