Closed terje2001 closed 10 years ago
thx helped me! Same as https://github.com/g2p/bedup/issues/34
This change seems to have solved issue #34 for me, thanks.
Thanks for that diff, terje2001. I can't apply it because it would cause bugs (by including files that are outside the targeted filesystem) but it explains where the performance issue is. Not sure I can make the sql more efficient. Maybe a DISTINCT or an ORDER BY will help, or maybe we'll have to switch to a simpler datastore for the Inode table.
I'm not well versed on databases, but I notice that sqlalchemy 0.9 has a note about the behavior of select_from
:
Might the way pre-0.9 used anon explain the memory usage? I have only 0.8.5 on Fedora 20. @g2p, are you perhaps running with sqlalchemy 0.9+, so you only see the new behavior?
Hmm, I actually have that backwards. I didn't realize that I had a local sqlalchemy 0.9 which was overriding the distro version. As soon as I remove that and fall back to Fedora's sqlalchemy 0.8.5, it seems to be fine on memory.
As the sqlalchemy release notes suggest for the "old" behavior, I tried switching bedup to select_entity_from, and so far this appears to be working with either Fedora's sqlalchemy 0.8.5 or a local 0.9.4. So I believe the proper patch is this:
diff --git a/bedup/tracking.py b/bedup/tracking.py
index 68a123b..01aa797 100644
--- a/bedup/tracking.py
+++ b/bedup/tracking.py
@@ -330,7 +330,7 @@ class WindowedQuery(object):
window_start = li[0].size
window_end = li[-1].size
# If we wanted to be subtle we'd use limits here as well
- inodes = self.sess.query(Inode).select_from(self.filtered_s).join(
+ inodes = self.sess.query(Inode).select_entity_from(self.filtered_s).join(
window_select, window_select.c.size == Inode.size
).order_by(-Inode.size, Inode.ino)
inodes_by_size = groupby(inodes, lambda inode: inode.size)
@cuviper, thank you for the well thought-out patch and the relevant pointers.
Have a btrfs instance with more than 2m files. Running dedup is incredibly slow and uses 3+gb of memory. Ran with --verbose-sql and it is evident that one of the main SQL statements is very inefficient and seems to produce duplicate rows as output that increases exponentially with number of files being processed.
Looked through the code and the following change made a world of difference:
Sorry for pasting the patch, I am not proficient with GIT. Hope this is ok.