FlorianHeigl / hardlinkpy

Automatically exported from code.google.com/p/hardlinkpy
GNU General Public License v2.0
0 stars 0 forks source link

On-disk cache would be nice, to avoid excessive memory use #9

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
- What steps will reproduce the problem?
1. Run against file tree containing millions of files
2. Watch memory use eventually grow to several hundred MB

- What is the expected output? What do you see instead?
I would like hardlink.py's memory use to remain moderate.  Instead it will
eventually use all the RAM of the small virtual machines I use for backing
things up.

- What version of the product are you using? On what operating system?
hardlink.py: 0.05 - 2010-01-07 (07-Jan-2010), Debian Lenny

- Please provide any additional information below.
hardlink.py tries to keep its cache of file data in memory, but on
directory trees containing 10s of millions of files (such as those I have
from backing up other machines) this is difficult to fit in moderate RAM.

It would be nice to optionally be able to use an on-disk cache file of some
sort so that it doesn't need to keep it all in RAM.  This should be
optional and probably non-default because it will surely be slower.

Cheers,
Andy

Original issue reported on code.google.com by bill...@bitfolk.com on 31 Jan 2010 at 4:06

GoogleCodeExporter commented 9 years ago
Currently having a look at this. I initially did a small patch to convert the 
file_hashes dictionary into a shelve "dictionary" like object, but it choked on 
long 
integers as keys [0]. I'm unsure on whether to make my patch [1] bigger, or 
file a bug 
against shelve to make it deal with long keys properly.

[0] http://pastebin.com/m5f13fa62
[1] http://pastebin.com/f5b529dfb

Original comment by jshholl...@googlemail.com on 4 Feb 2010 at 10:15

GoogleCodeExporter commented 9 years ago
I have now written a new patch that wraps the anydbm module (working at a lower 
level 
than shelve). Limited testing appears to show that it works, but it is yet to 
be used 
on a large tree. It should be self contained.

Original comment by jshholl...@googlemail.com on 16 May 2010 at 12:20

Attachments:

GoogleCodeExporter commented 9 years ago
jshholland, applied your patch and it appears to be working well on first try.

had to modify it slightly to get it working.

changed

self.db = anydbm.open(name, 'n')

to

self.db = anydbm.open(self.name, 'n')

Original comment by abnormal...@gmail.com on 10 Nov 2010 at 12:05