g2p / bedup

Btrfs deduplication
http://pypi.python.org/pypi/bedup
GNU General Public License v2.0
322 stars 50 forks source link

Sql is not scalable to number of files #42

Closed terje2001 closed 10 years ago

terje2001 commented 10 years ago

Have a btrfs instance with more than 2m files. Running dedup is incredibly slow and uses 3+gb of memory. Ran with --verbose-sql and it is evident that one of the main SQL statements is very inefficient and seems to produce duplicate rows as output that increases exponentially with number of files being processed.

Looked through the code and the following change made a world of difference:

[root@*** bedup]# git diff
diff --git a/bedup/tracking.py b/bedup/tracking.py
index d73e1bb..49e4d71 100644
--- a/bedup/tracking.py
+++ b/bedup/tracking.py
@@ -329,7 +329,7 @@ class WindowedQuery(object):
             window_start = li[0].size
             window_end = li[-1].size
             # If we wanted to be subtle we'd use limits here as well
-            inodes = self.sess.query(Inode).select_from(self.filtered_s).join(
+            inodes = self.sess.query(Inode).join(
                 window_select, window_select.c.size == Inode.size
             ).order_by(-Inode.size, Inode.ino)
             inodes_by_size = groupby(inodes, lambda inode: inode.size)

Sorry for pasting the patch, I am not proficient with GIT. Hope this is ok.

Burnfaker commented 10 years ago

thx helped me! Same as https://github.com/g2p/bedup/issues/34

krk242 commented 10 years ago

This change seems to have solved issue #34 for me, thanks.

g2p commented 10 years ago

Thanks for that diff, terje2001. I can't apply it because it would cause bugs (by including files that are outside the targeted filesystem) but it explains where the performance issue is. Not sure I can make the sql more efficient. Maybe a DISTINCT or an ORDER BY will help, or maybe we'll have to switch to a simpler datastore for the Inode table.

cuviper commented 10 years ago

I'm not well versed on databases, but I notice that sqlalchemy 0.9 has a note about the behavior of select_from:

http://docs.sqlalchemy.org/en/rel_0_9/changelog/migration_09.html#query-select-from-no-longer-applies-the-clause-to-corresponding-entities

Might the way pre-0.9 used anon explain the memory usage? I have only 0.8.5 on Fedora 20. @g2p, are you perhaps running with sqlalchemy 0.9+, so you only see the new behavior?

cuviper commented 10 years ago

Hmm, I actually have that backwards. I didn't realize that I had a local sqlalchemy 0.9 which was overriding the distro version. As soon as I remove that and fall back to Fedora's sqlalchemy 0.8.5, it seems to be fine on memory.

As the sqlalchemy release notes suggest for the "old" behavior, I tried switching bedup to select_entity_from, and so far this appears to be working with either Fedora's sqlalchemy 0.8.5 or a local 0.9.4. So I believe the proper patch is this:

diff --git a/bedup/tracking.py b/bedup/tracking.py
index 68a123b..01aa797 100644
--- a/bedup/tracking.py
+++ b/bedup/tracking.py
@@ -330,7 +330,7 @@ class WindowedQuery(object):
             window_start = li[0].size
             window_end = li[-1].size
             # If we wanted to be subtle we'd use limits here as well
-            inodes = self.sess.query(Inode).select_from(self.filtered_s).join(
+            inodes = self.sess.query(Inode).select_entity_from(self.filtered_s).join(
                 window_select, window_select.c.size == Inode.size
             ).order_by(-Inode.size, Inode.ino)
             inodes_by_size = groupby(inodes, lambda inode: inode.size)
g2p commented 10 years ago

@cuviper, thank you for the well thought-out patch and the relevant pointers.