AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
4.71k stars 154 forks source link

[BUG] Remote collectors no longer updating #653

Closed PleaseStopAsking closed 1 week ago

PleaseStopAsking commented 3 weeks ago

Describe the bug I have been using scrutiny for about a year with no issues but on May 15th, I noticed that my remote collector was no longer updating metrics in my dashboard. Upon review of the logs of said collector, I discovered that all the update attempts were failing with an i/o timeout. I attempted to manually kick the collector off with docker exec scrutiny-collector /opt/scrutiny/bin/scrutiny-collector-metrics run --debug but the same failure occurs.

I then attempted to send the same request manually via curl and had no issues so it appears that something within the collector image is causing this sort of behavior but that is only an assumption at this time.

On top of curl, I can connect via netcat with no issues. nc -v 100.97.12.45 443

working curl

curl -d '{"data":[{"wwn":"ei82n045010803d5j","device_name":"nvme0","device_uuid":"","device_serial_id":"","device_label":"","manufacturer":"","model_name":"bar","interface_type":"","interface_speed":"","serial_number":"EI82N045010803D5J","firmware":"80000E00","rotational_speed":0,"capacity":256060514304,"form_factor":"","smart_support":false,"device_protocol":"NVMe","device_type":"nvme","label":"","host_id":"foo"}]}' -H "Content-Type: application/json" -X POST https://scrutiny.int.example.dev/api/devices/register
image

Attempts at fixing this:

Expected behavior remote collectors successfully send device metrics to the hub container

Log Files Collector Debug Logs

2024/06/06 13:14:19 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                                dev-0.8.1

time="2024-06-06T13:14:19Z" level=debug msg="{\n\t\"api\": {\n\t\t\"endpoint\": \"https://scrutiny.int.example.dev/\"\n\t},\n\t\"commands\": {\n\t\t\"metrics_info_args\": \"--info --json\",\n\t\t\"metrics_scan_args\": \"--scan --json\",\n\t\t\"metrics_smart_args\": \"--xall --json\",\n\t\t\"metrics_smartctl_bin\": \"smartctl\"\n\t},\n\t\"devices\": [],\n\t\"host\": {\n\t\t\"id\": \"Data\"\n\t},\n\t\"log\": {\n\t\t\"file\": \"\",\n\t\t\"level\": \"DEBUG\"\n\t}\n}<nil>" type=metrics
time="2024-06-06T13:14:19Z" level=info msg="Verifying required tools" type=metrics
time="2024-06-06T13:14:19Z" level=info msg="Executing command: smartctl --scan --json" type=metrics
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      3
    ],
    "svn_revision": "5338",
    "platform_info": "x86_64-linux-5.15.0-107-generic",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--scan",
      "--json"
    ],
    "exit_status": 0
  },
  "devices": [
    {
      "name": "/dev/sda",
      "info_name": "/dev/sda",
      "type": "scsi",
      "protocol": "SCSI"
    },
    {
      "name": "/dev/sdb",
      "info_name": "/dev/sdb",
      "type": "scsi",
      "protocol": "SCSI"
    }
  ]
}
time="2024-06-06T13:14:19Z" level=info msg="Executing command: smartctl --info --json /dev/sdb" type=metrics
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      3
    ],
    "svn_revision": "5338",
    "platform_info": "x86_64-linux-5.15.0-107-generic",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--info",
      "--json",
      "/dev/sdb"
    ],
    "drive_database_version": {
      "string": "7.3/5319"
    },
    "exit_status": 0
  },
  "local_time": {
    "time_t": 1717679659,
    "asctime": "Thu Jun  6 13:14:19 2024 UTC"
  },
  "device": {
    "name": "/dev/sdb",
    "info_name": "/dev/sdb [SAT]",
    "type": "sat",
    "protocol": "ATA"
  },
  "model_name": "Vaseky V850/128G",
  "serial_number": "AA000000000000002173",
  "firmware_version": "U0524A0",
  "user_capacity": {
    "blocks": 250069680,
    "bytes": 128035676160
  },
  "logical_block_size": 512,
  "physical_block_size": 512,
  "rotation_rate": 0,
  "form_factor": {
    "ata_value": 6,
    "name": "mSATA"
  },
  "trim": {
    "supported": true,
    "deterministic": false,
    "zeroed": false
  },
  "in_smartctl_database": false,
  "ata_version": {
    "string": "ACS-3 T13/2161-D revision 4",
    "major_value": 2040,
    "minor_value": 283
  },
  "sata_version": {
    "string": "SATA 3.2",
    "value": 255
  },
  "interface_speed": {
    "max": {
      "sata_value": 14,
      "string": "6.0 Gb/s",
      "units_per_second": 60,
      "bits_per_unit": 100000000
    },
    "current": {
      "sata_value": 3,
      "string": "6.0 Gb/s",
      "units_per_second": 60,
      "bits_per_unit": 100000000
    }
  },
  "smart_support": {
    "available": true,
    "enabled": true
  }
}
time="2024-06-06T13:14:19Z" level=info msg="Using WWN Fallback" type=metrics
time="2024-06-06T13:14:19Z" level=debug msg="WWN is empty, falling back to serial number: AA000000000000002173" type=metrics
time="2024-06-06T13:14:19Z" level=info msg="Executing command: smartctl --info --json /dev/sda" type=metrics
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      3
    ],
    "svn_revision": "5338",
    "platform_info": "x86_64-linux-5.15.0-107-generic",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--info",
      "--json",
      "/dev/sda"
    ],
    "drive_database_version": {
      "string": "7.3/5319"
    },
    "exit_status": 0
  },
  "local_time": {
    "time_t": 1717679659,
    "asctime": "Thu Jun  6 13:14:19 2024 UTC"
  },
  "device": {
    "name": "/dev/sda",
    "info_name": "/dev/sda [SAT]",
    "type": "sat",
    "protocol": "ATA"
  },
  "model_family": "Seagate Barracuda 2.5 5400",
  "model_name": "ST2000LM015-2E8174",
  "serial_number": "ZDZLSRZY",
  "wwn": {
    "naa": 5,
    "oui": 3152,
    "id": 3831876119
  },
  "firmware_version": "0001",
  "user_capacity": {
    "blocks": 3907029168,
    "bytes": 2000398934016
  },
  "logical_block_size": 512,
  "physical_block_size": 4096,
  "rotation_rate": 5400,
  "form_factor": {
    "ata_value": 3,
    "name": "2.5 inches"
  },
  "trim": {
    "supported": true,
    "deterministic": false,
    "zeroed": false
  },
  "in_smartctl_database": true,
  "ata_version": {
    "string": "ACS-3 T13/2161-D revision 3b",
    "major_value": 2016,
    "minor_value": 31
  },
  "sata_version": {
    "string": "SATA 3.1",
    "value": 127
  },
  "interface_speed": {
    "max": {
      "sata_value": 14,
      "string": "6.0 Gb/s",
      "units_per_second": 60,
      "bits_per_unit": 100000000
    },
    "current": {
      "sata_value": 3,
      "string": "6.0 Gb/s",
      "units_per_second": 60,
      "bits_per_unit": 100000000
    }
  },
  "smart_support": {
    "available": true,
    "enabled": true
  }
}
time="2024-06-06T13:14:19Z" level=info msg="Generating WWN" type=metrics
time="2024-06-06T13:14:19Z" level=debug msg="NAA: 5 OUI: 3152 Id: 3831876119 => WWN: 0x5000c500e465ca17" type=metrics
time="2024-06-06T13:14:19Z" level=info msg="Sending detected devices to API, for filtering & validation" type=metrics
time="2024-06-06T13:14:19Z" level=debug msg="Detected devices: [{\"wwn\":\"aa000000000000002173\",\"device_name\":\"sdb\",\"device_uuid\":\"\",\"device_serial_id\":\"ata-Vaseky_V850/128G_AA000000000000002173\",\"device_label\":\"\",\"manufacturer\":\"\",\"model_name\":\"Vaseky V850/128G\",\"interface_type\":\"\",\"interface_speed\":\"6.0 Gb/s\",\"serial_number\":\"AA000000000000002173\",\"firmware\":\"U0524A0\",\"rotational_speed\":0,\"capacity\":128035676160,\"form_factor\":\"mSATA\",\"smart_support\":false,\"device_protocol\":\"ATA\",\"device_type\":\"sat\",\"label\":\"\",\"host_id\":\"Data\"},{\"wwn\":\"0x5000c500e465ca17\",\"device_name\":\"sda\",\"device_uuid\":\"52d3618d-9eb9-49b3-907b-d1bf50ead856\",\"device_serial_id\":\"ata-ST2000LM015-2E8174_ZDZLSRZY\",\"device_label\":\"store\",\"manufacturer\":\"\",\"model_name\":\"ST2000LM015-2E8174\",\"interface_type\":\"\",\"interface_speed\":\"6.0 Gb/s\",\"serial_number\":\"ZDZLSRZY\",\"firmware\":\"0001\",\"rotational_speed\":5400,\"capacity\":2000398934016,\"form_factor\":\"2.5 inches\",\"smart_support\":false,\"device_protocol\":\"ATA\",\"device_type\":\"sat\",\"label\":\"\",\"host_id\":\"Data\"}]" type=metrics
2024/06/06 13:14:49 ERROR: Post "https://scrutiny.int.example.dev/api/devices/register": dial tcp 100.97.12.45:443: i/o timeout

Compose

scrutiny-collector:
    image: ghcr.io/analogj/scrutiny:v0.8.1-collector
    hostname: scrutiny-collector
    container_name: scrutiny-collector
    restart: unless-stopped
    cap_add:
      - SYS_RAWIO
    devices:
      - /dev/sda
      - /dev/sdb
    volumes:
      - /run/udev:/run/udev:ro
    environment:
      COLLECTOR_API_ENDPOINT: ${collector_endpoint}
      COLLECTOR_HOST_ID: Data
      COLLECTOR_CRON_SCHEDULE: "30 12 * * *"

Docker Info

Client: Docker Engine - Community
 Version:    26.1.4
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 4
  Running: 4
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 26.1.4
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d2d58213f83a351ca8f528a95fbd145f5654e957
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-107-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.681GiB
 Name: data
 ID: NRMH:GG63:T7FU:UDBT:5BRN:PVDS:K4CF:VUJC:64GN:DB3W:SNQL:M73L
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
PleaseStopAsking commented 1 week ago

update: This issue is still occurring. I have now attempted to simplify the setup as much as possible by removing DNS records, Tailscale and TLS from the equation with no change unfortunately.

I installed curl and dnsutils inside the collector container manually and the output of nslookup is correct but all attempts at connecting via netcat simply fail.

As a final test, I installed the collector manually on the remote host and have zero issues sending metrics to scrutiny-web running in docker. This at least isolates the issue to the remote collector container

PleaseStopAsking commented 1 week ago

Closing this out as it appears to be completely related to Tailscale unfortunately. I thought I had ruled it out but clearly not. https://github.com/tailscale/tailscale/issues/12070#issuecomment-2102571116