Open gerasiov opened 2 years ago
/usr/sbin/bees is the binary in my example (I packages bees a little bit different in my env)/
The test case is not very representative. Most filesystems do not consist entirely of pairs of duplicate files that are otherwise unique, where each file contains fewer extents than there are cores in the CPU. Even filesystems that are structured that way still have a >95% dedupe hit rate: I ran this test 100 times and failed to dedupe 10 out of 800 total extents (8 extents per run), for a success rate of 98.75%. Admittedly, in this corner case, the reasons for not getting all the way to 100% are bad, but any success rate over 95% is still OK for the current bees design.
A tiny data set like this one will hit several known issues in the current design:
file1
and file2
). Since each extent ref is scanned only once, the duplicated data is not detected until file3
comes along, or beescrawl.dat
is reset. In those cases bees would repeat the scan and find the hashes inserted by the previous run. In a more typical filesystem this event is extremely rare.ExtentWalker
class will throw an exception if a reverse extent search files due to an inconsistent view of metadata. If dedupe changes a neighbouring extent while the search is running, ExtentWalker
will throw an exception, and processing of the extent will be abandoned. This happened in one of my test runs but not yours.LOGICAL_INO
operate on the same parts of the metadata tree (it results in a large number of loops in tree mod log). The solution in both cases is to provide separation in time between identification of identical extents and the dedupe commands to remove them. Other dedupers achieve this easily by having completely separate scan and dedupe phases, at the cost of much more RAM usage.Although these problems are individually fixable, the current crawler design has many more problems not listed above, and significant rework is required to gain even tiny improvements over the current implementation. Instead, the existing crawler code will be replaced by a new design which avoids the above issues from the start.
Instead, the existing crawler code will be replaced by a new design which avoids the above issues from the start.
Very interesting!
Is there an issue to sub/way to get notified when you start work on the new design and try it out?
bees does not deduplicate all data and this is reproducible.
How to reproduce:
5.copy file once again 6.run bees (and wait for deduplication complete) 128MB of each files are not deduplicated
7.remove beeshash.dat and restart bees suddenly all data is deduplicated
detailed log