Closed ovidiustanila closed 11 months ago
Found that a bunch of missing attribs were related to the failure to copy attrib across filesystems (we split the pool on two disks due to a disk size limitation) rename didn't default to a regular file copy and resulted in temporary files in BackupPC/pool and missing attrib files. bpc_poolWrite_addToPool: replacing empty pool file
These changes worked for us and got rid of this scenario.
Thanks for the PR!
We had some problems with a prior backup (out of disk space) and then we started hitting a lot of errors on the next trigger, like:
G bpc_attribCache_dirWrite: failed to write attributes for dir f%2f/[path]/f239589/fthumb/attrib
G bpc_attrib_dirWrite: rename from /data/BackupPC/pool/46/e8/46e80cf0e3b683b5f82d951a37ae7037 to /data/BackupPC/pc/[host]/368/f%2f/[path]/f239589/attrib_46e80cf0e3b683b5f82d951a37ae7037 failed
G bpc_attribCache_dirWrite: failed to write attributes for dir f%2f/[path]/f239589/attrib
G bpc_attrib_dirWrite: can't open/create raw /data/BackupPC/pool/f4/cc/f5cdcfe2b35f182d10d3b371335c9880 for writing
G bpc_attribCache_dirWrite: failed to write attributes for dir f%2f/[path]/f239689/attrib
After some digging around found that those were caused by our file count limits: $ prlimit -p28051 -n RESOURCE DESCRIPTION SOFT HARD UNITS NOFILE max number of open files 1024 4096
and rsync_bpc had a lot of open files and that caused all kind of different errors when pooling;
$ lsof -p 28051 | grep -Po '/.*needFsck[0-9]' | sort | uniq -c 333 /data/BackupPC/pc/[host]/367/refCnt/needFsck1 940 /data/BackupPC/pc/[host]/368/refCnt/needFsck1
We've increased those limits to get things going and patched rsync_bpc to close the file to avoid this in the future. We'll do a complete fsck to get rid of what errors were added during this problem and re-trigger a full backup for all hosts, hopefully that will get rid of most of the errors.
Cheers, Ovidiu