cea-hpc / robinhood

Robinhood Policy Engine : a versatile tool to monitor filesystem contents and schedule actions on filesystem entries.
http://robinhood.sf.net
Other
177 stars 60 forks source link

[policies] Error retrieving list of candidates from database #126

Closed garadar closed 1 year ago

garadar commented 1 year ago

Hi all,

I don't understand why my policies are canceled. The first scan was OK and policies were canceled during the second.

Restarting the service (robinhood@scratch) does not change anything. Destroying the DB and rebuilding from scratch works (but again policies are canceled)

(cluster)-[root@xxxx ~]$ rpm -qa |grep robinhood
robinhood-posix-3.1.7-1.el7.x86_64
robinhood-webgui-3.1.7-1.x86_64
robinhood-adm-3.1.7-1.x86_64

/var/log/robinhood/scratch/robinhood.log

2022/09/25 19:39:32 [10653/1] STATS | ==================== Dumping stats at 2022/09/25 19:39:32 =====================
2022/09/25 19:39:32 [10653/1] STATS | ======== General statistics =========
2022/09/25 19:39:32 [10653/1] STATS | Daemon start time: 2022/09/20 14:24:27
2022/09/25 19:39:32 [10653/1] STATS | Started modules: scan,policy_run(alert)
2022/09/25 19:39:32 [10653/1] STATS | ======== FS scan statistics =========
2022/09/25 19:39:32 [10653/1] STATS | last scan  = 2022/09/24 10:25:09
2022/09/25 19:39:32 [10653/1] STATS | duration    = 3d 20h 42s (331242 s)
2022/09/25 19:39:32 [10653/1] STATS | status      = complete
2022/09/25 19:39:32 [10653/1] STATS | current scan interval = 1.0d
2022/09/25 19:39:32 [10653/1] STATS | scan is running:
2022/09/25 19:39:32 [10653/1] STATS |      started at : 2022/09/25 10:25:45 (9.2h ago)
2022/09/25 19:39:32 [10653/1] STATS |      last action: 2022/09/25 19:39:32 (00s ago)
2022/09/25 19:39:32 [10653/1] STATS |      progress   : 37581642 entries scanned (0 errors)
2022/09/25 19:39:32 [10653/1] STATS |      inst. speed (potential):   1166.18 entries/sec (1.72 ms/entry/thread)
2022/09/25 19:39:32 [10653/1] STATS |      avg. speed  (effective):   1131.06 entries/sec (1.74 ms/entry/thread)
2022/09/25 19:39:32 [10653/1] STATS | ==== EntryProcessor Pipeline Stats ===
2022/09/25 19:39:32 [10653/1] STATS | Idle threads: 12
2022/09/25 19:39:32 [10653/1] STATS | Id constraints count: 40 (hash min=0/max=1/avg=0.0)
2022/09/25 19:39:32 [10653/1] STATS | Name constraints count: 40 (hash min=0/max=1/avg=0.0)
2022/09/25 19:39:32 [10653/1] STATS | Stage              | Wait | Curr | Done |     Total | ms/op |
2022/09/25 19:39:32 [10653/1] STATS |  0: GET_FID        |    0 |    0 |    0 |         0 |  0.00 |
2022/09/25 19:39:32 [10653/1] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 |   1223152 |  0.46 |
2022/09/25 19:39:32 [10653/1] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 |   1223154 |  0.03 |
2022/09/25 19:39:32 [10653/1] STATS |  3: PRE_APPLY      |    0 |    0 |    0 |   1223154 |  0.00 |
2022/09/25 19:39:32 [10653/1] STATS |  4: DB_APPLY       |   23 |   17 |    0 |   1223211 |  0.73 | 99.01% batched (avg batch size: 31.3)
2022/09/25 19:39:32 [10653/1] STATS |  5: RM_OLD_ENTRIES |    0 |    0 |    0 |         0 |  0.00 |
2022/09/25 19:39:32 [10653/1] STATS | DB ops: get=36317212/ins=261907201/upd=37237011/rm=0
2022/09/25 19:39:32 [10653/1] STATS | --- Pipeline stage details ---
2022/09/25 19:39:32 [10653/1] STATS | DB_APPLY      : first: D8E7E870/12701573683717011039, status=processing
2022/09/25 19:39:32 [10653/1] STATS | DB_APPLY      : last: D8E7E870/3000079987675090840, status=waiting
2022/09/25 19:39:32 [10653/1] STATS | ======= alert policy: trigger stats ======
2022/09/25 19:39:32 [10653/1] STATS | Trigger #0 (scheduled)        : running.
2022/09/25 19:39:32 [10653/1] STATS |     last run: none
2022/09/25 19:39:32 [10653/1] STATS |     total   : none
2022/09/25 19:39:32 [10653/1] STATS | Trigger #1 (scheduled)        : OK (last check: 2022/09/21 00:21:15).
2022/09/25 19:39:32 [10653/1] STATS |     last run: 2951693 entries, total volume 11.32 TB (24319392078 blocks)
2022/09/25 19:39:32 [10653/1] STATS |     total   : 2951693 entries, total volume 11.32 TB (24319392078 blocks)
2022/09/25 19:39:32 [10653/1] STATS | ======= alert policy: action stats ======
2022/09/25 19:39:32 [10653/1] STATS | idle threads       = 9
2022/09/25 19:39:32 [10653/1] STATS | queued entries     = 0
2022/09/25 19:39:32 [10653/1] STATS | action status:
2022/09/25 19:39:32 [10653/1] STATS |     action successful              = 21659078
2022/09/25 19:39:32 [10653/1] STATS |     moved or deleted since last update = 9443
2022/09/25 19:39:32 [10653/1] STATS |     whitelisted/ignored            = 16
2022/09/25 19:39:32 [10653/1] STATS |     incomplete metadata            = 46
2022/09/25 19:39:32 [10653/1] STATS |     action error                   = 19130
2022/09/25 19:39:32 [10653/1] STATS | 21659078 actions successful/21687713, 63.95 TB (137351906236 blocks)
2022/09/25 19:39:32 [10653/1] STATS | last action queued     0 s ago
2022/09/25 19:39:32 [10653/1] STATS | last action started    0 s ago
2022/09/25 19:39:32 [10653/1] STATS | last action completed  0 s ago
2022/09/25 19:47:13 [10653/4] alert | Policy is running (started 4d 18h 25min 57s ago): 18771561 actions succeeded (45.57/sec); volume: 52.64 TB (133.99 MB/sec); skipped: 9868; errors: 18565

2022/09/25 19:54:32 [10653/1] STATS | ==================== Dumping stats at 2022/09/25 19:54:32 =====================
2022/09/25 19:54:32 [10653/1] STATS | ======== General statistics =========
2022/09/25 19:54:32 [10653/1] STATS | Daemon start time: 2022/09/20 14:24:27
2022/09/25 19:54:32 [10653/1] STATS | Started modules: scan,policy_run(alert)
2022/09/25 19:54:32 [10653/1] STATS | ======== FS scan statistics =========
2022/09/25 19:54:32 [10653/1] STATS | last scan  = 2022/09/24 10:25:09
2022/09/25 19:54:32 [10653/1] STATS | duration    = 3d 20h 42s (331242 s)
2022/09/25 19:54:32 [10653/1] STATS | status      = complete
2022/09/25 19:54:32 [10653/1] STATS | current scan interval = 1.0d
2022/09/25 19:54:32 [10653/1] STATS | scan is running:
2022/09/25 19:54:32 [10653/1] STATS |      started at : 2022/09/25 10:25:45 (9.5h ago)
2022/09/25 19:54:32 [10653/1] STATS |      last action: 2022/09/25 19:54:31 (01s ago)
2022/09/25 19:54:32 [10653/1] STATS |      progress   : 38221348 entries scanned (0 errors)
2022/09/25 19:54:32 [10653/1] STATS |      inst. speed (potential):     45.20 entries/sec (44.25 ms/entry/thread)
2022/09/25 19:54:32 [10653/1] STATS |      avg. speed  (effective):   1119.97 entries/sec (1.76 ms/entry/thread)
2022/09/25 19:54:32 [10653/1] STATS | ==== EntryProcessor Pipeline Stats ===
2022/09/25 19:54:32 [10653/1] STATS | Idle threads: 16
2022/09/25 19:54:32 [10653/1] STATS | Id constraints count: 0 (hash min=0/max=0/avg=0.0)
2022/09/25 19:54:32 [10653/1] STATS | Name constraints count: 0 (hash min=0/max=0/avg=0.0)
2022/09/25 19:54:32 [10653/1] STATS | Stage              | Wait | Curr | Done |     Total | ms/op |
2022/09/25 19:54:32 [10653/1] STATS |  0: GET_FID        |    0 |    0 |    0 |         0 |  0.00 |
2022/09/25 19:54:32 [10653/1] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 |    639611 |  0.77 |
2022/09/25 19:54:32 [10653/1] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 |    639611 |  0.08 |
2022/09/25 19:54:32 [10653/1] STATS |  3: PRE_APPLY      |    0 |    0 |    0 |    639611 |  0.00 |
2022/09/25 19:54:32 [10653/1] STATS |  4: DB_APPLY       |    0 |    0 |    0 |    639651 |  1.01 | 95.94% batched (avg batch size: 18.7)
2022/09/25 19:54:32 [10653/1] STATS |  5: RM_OLD_ENTRIES |    0 |    0 |    0 |         0 |  0.00 |
2022/09/25 19:54:32 [10653/1] STATS | DB ops: get=36950100/ins=261907201/upd=37876662/rm=0
2022/09/25 19:54:32 [10653/1] STATS | ======= alert policy: trigger stats ======
2022/09/25 19:54:32 [10653/1] STATS | Trigger #0 (scheduled)        : running.
2022/09/25 19:54:32 [10653/1] STATS |     last run: none
2022/09/25 19:54:32 [10653/1] STATS |     total   : none
2022/09/25 19:54:32 [10653/1] STATS | Trigger #1 (scheduled)        : OK (last check: 2022/09/21 00:21:15).
2022/09/25 19:54:32 [10653/1] STATS |     last run: 2951693 entries, total volume 11.32 TB (24319392078 blocks)
2022/09/25 19:54:32 [10653/1] STATS |     total   : 2951693 entries, total volume 11.32 TB (24319392078 blocks)
2022/09/25 19:54:32 [10653/1] STATS | ======= alert policy: action stats ======
2022/09/25 19:54:32 [10653/1] STATS | idle threads       = 16
2022/09/25 19:54:32 [10653/1] STATS | queued entries     = 0
2022/09/25 19:54:32 [10653/1] STATS | action status:
2022/09/25 19:54:32 [10653/1] STATS |     action successful              = 21723260
2022/09/25 19:54:32 [10653/1] STATS |     moved or deleted since last update = 9901
2022/09/25 19:54:32 [10653/1] STATS |     whitelisted/ignored            = 16
2022/09/25 19:54:32 [10653/1] STATS |     incomplete metadata            = 65
2022/09/25 19:54:32 [10653/1] STATS |     action error                   = 19130
2022/09/25 19:54:32 [10653/1] STATS | 21723260 actions successful/21752372, 63.96 TB (137371843101 blocks)
2022/09/25 19:54:32 [10653/1] STATS | last action queued    439 s ago
2022/09/25 19:54:32 [10653/1] STATS | last action started   439 s ago
2022/09/25 19:54:32 [10653/1] STATS | last action completed 439 s ago
2022/09/25 20:03:10 [10653/4] alert | Policy is running (started 4d 18h 41min 54s ago): 18771567 actions succeeded (45.46/sec); volume: 52.64 TB (133.68 MB/sec); skipped: 13338; errors: 18671
2022/09/25 20:03:11 [10653/4] alert | Policy run summary: time=4d 18h 41min 55s; target=all; 18771567 successful actions (45.46/sec); volume: 52.64 TB (133.68 MB/sec); 13341 entries skipped; 18671 errors.
2022/09/25 20:03:11 [10653/4] alert | Checking policy rules for all
2022/09/25 20:03:11 [10653/4] alert | Starting policy run
2022/09/25 20:08:11 [10653/4] alert | Error retrieving list of candidates from database. Policy run cancelled.
2022/09/25 20:08:11 [10653/4] alert | Error running policy on all. 0 successful actions; volume: 0; 0 entries skipped; 0 errors.
(cluster)-[root@xxxx ~]$ rbh-report -f scratch -a
Using config file '/etc/robinhood.d/scratch.conf'.

Filesystem scan activity:

    Previous filesystem scan:
            start:           2022/09/20 14:24:27
            duration:        3d 20h 42s

    Last filesystem scan:
            status:          running
            start:           2022/09/25 10:25:45 (1d 2h 58min 25s ago)
            last action:     2022/09/26 13:09:32 (14min 38s ago)

         Statistics:
            entries scanned: 136487301
            # threads:       2
            average speed:   1428.57 entries/sec
        >>> current speed:   2666.67 entries/sec

Storage usage has never been checked

Policy 'alert':
    No current run

Has anyone ever had this happen?

garadar commented 1 year ago

Error resolved:

In /var/log/mysqld.log I saw this message

2022-10-25T12:46:56.193113Z 39 [Note] Aborted connection 39 to db: 'robinhood_scratch' user: 'robinhood' host: 'localhost' Error writing file '/tmp/MYr9jLF0' (Errcode: 28 - No space left )

Solution:

Change the variable tmpdir in /etc/my.cnf to location with enough space .

[mysqld]
<...>
tmpdir = /need/space/tmp

and then my policies come back to life !

2022/10/25 16:25:42 [2677/1] STATS | 5094 actions successful/5471, 13.28 GB (27848960 blocks)
2022/10/25 16:25:42 [2677/1] STATS | last action queued     2 s ago
2022/10/25 16:25:42 [2677/1] STATS | last action started    2 s ago
2022/10/25 16:25:42 [2677/1] STATS | last action completed  2 s ago