cea-hpc / robinhood

Robinhood Policy Engine : a versatile tool to monitor filesystem contents and schedule actions on filesystem entries.
http://robinhood.sf.net
Other
177 stars 60 forks source link

lhsm_archive - number of errors are accumulating until it reaches max #106

Open geraldhofer opened 5 years ago

geraldhofer commented 5 years ago

It looks like the Lustre changelog is currently leaking records where UNLINK records that never report actual deletion (UNLINK_LAST flag is not set) when a file is still opened when deleted.

Apparently a possible reproducer is:

We end up with orphans in the database that do have all the entries, except the path.

Apparently we need to fix the underlying issue in Lustre and I am working on that.

But some user application does apparently trigger that issue now to an extend that it starts to impact our ability to migrate files in an reasonable time. We already set the suspend_error_min=10000000 quite high, anticipating to never hit that - but we still eventually got to a stage where we end up with too many errors and a very long runtime on the archive:

Policy 'lhsm_archive':
    Current run started on 2019/07/04 17:13:51: trigger: scheduled (daemon), target: all
    Last complete run: trigger: scheduled (daemon), target: all
        - Started on 2019/07/02 19:53:22
        - Finished on 2019/07/04 16:34:14 (duration: 1d 20h 40min 52s)
        - Summary: 48 successful actions, volume: 96.75 GB; 0 entries skipped; 10000002 errors

These are the errors we see in the log files:

2019/07/04 03:19:35 [32655/16] lhsm_archive | Warning: cannot determine if entry  is whitelisted: skipping it.
2019/07/04 03:19:35 [32655/16] Policy | [0x200128cc4:0x1709:0x0]: attribute is missing for checking fileset 'scratch'
2019/07/04 03:19:35 [32655/20] Policy | Missing attribute 'fullpath' for evaluating boolean expression on [0x200128cc4:0x1715:0x0]
2019/07/04 03:19:35 [32655/20] Policy | [0x200128cc4:0x1715:0x0]: attribute is missing for checking ignore_fileclass rule

The reason why these orphans are affecting us is that we use a fileclass that does require the path information to determine if we want to migrate the file or not:

FileClass scratch {
        definition {
            tree == "/lustre/scratch"
        }
}

So at the time of that error, the entry is already deleted from the Lustre file system. Every subsequent archive run has to go through all the old errors again, so the runtime increases and when the suspend_error_min is reached before we are archiving all the entries, we are missing files to migrate.

It looks like only a scan can remove the entries out of the database. A scan does take more that a day on this system and maxes out the database and increases the load on the MDS, so we don't want to run it that frequently.

In the end this is a database corruption issue. We basically have some entries that are corrupted/inconsistent (in this case by the Lustre bug), that are causing errors during the migration. I think it does make sense to try to rectify these database errors by reading the entries again from the file system and try to rectify these errors at that time of the archive. That would avoid the need for a scan in a more general way as database inconsistencies get rectified as they are discovered by the migration. I am in the lucky position that we do have upgraded the hardware and that I was able to optimise the scan to about a day (from a week). I would not be able to deal with that at all if my scan times are in the range of a week. But it would help to avoid a scan in a more general way if errors would trigger an rescan of that FID.

dtcray commented 5 years ago

Possible option would be to set "invalid flag" in DB for these types of entry, most policies ignore entries with invalid flag set and subsequent scan would purge them from DB or set all required attributes to those ENTRIES

tl-cea commented 5 years ago

Gerald,

Your analysis is absolutely correct. This issue with open-deleted file is something we noticed too. As you mention, the best way to solve this is to make Lustre raise an UNLINK record with UNLINK_LAST flag when the entry is actually deleted.

If the error limitation annoys you, you can set "suspend_error_min = 0" (which is the default) to disable this limit.

As mentioned by dtcray, the policy run should set the entry as invalid if it no longer exists. The check if done before or after the rate limiting, depending on the policy parameters (see https://github.com/cea-hpc/robinhood/wiki/robinhood_v3_admin_doc#policy-parameters). If it is not the case, try: pre_sched_match = auto_update;

Regards, Thomas

geraldhofer commented 5 years ago

It looks like the option: pre_sched_match = auto_update; has worked around this issue successfully.

The suspend_error_min = 0 is not really helping as the runtime is getting far to long when too many errors accumulating. In my example it was already running more than a day, Usually it was running in a few minutes (not counting the database query).

tl-cea commented 5 years ago

Good. @geraldhofer Do you have an open LU about the UNLINK_LAST flag for open-deleted files?