Berimor66 / duplicati

Automatically exported from code.google.com/p/duplicati
0 stars 0 forks source link

Limit how many files are kept open #758

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
In the current setup, the RsyncDir instance will extract all filenames from the 
zip archives before continuing. This will open all the archives, but this gives 
"Too many files open" if there are more than 1000 volumes. They should be 
closed again, and then reopened later when they are actually needed.

Original issue reported on code.google.com by kenneth@hexad.dk on 13 Dec 2012 at 9:18

GoogleCodeExporter commented 9 years ago
I've finally had some time to do some research on this and somewhat understand 
what is going on.  Heh, I told ya it might take me a bit.  I'm somewhat 
comfortable with MonoDevelop now though.  I also had a little down time over 
the weekend to follow what the code is doing.

There is a question I'm wondering though.  Didn't follow the code to see what 
was going on with this part.  Are the patch files that are getting stored in 
/tmp(linux location) just you renaming signature files to patch.zip files?  
That is what it looks like to me.  I wouldn't think you would recompress that 
since it looks to be the same with just different naming on the file.  I did 
find a variable that wasn't being used that was taking up extra memory in the 
same part of the RsyncDir part of code though(m_patches).  Maybe there was some 
idea you had to use that for, but abandoned it later, I know how that goes. :)

Here is what I think is going on:
Basically when you have a backup, most likely going to be a full or files with 
a lot of data, it splits into a lot of volumes for that date.  Are the 
signatures just for the data located in that same zip?  Or is the signature 
file replicated between splits?  Your data for a file is more than 10Mb you are 
backing up.  Does it only store signature for that part of that file in the zip?

Do you understand?

For efficiency I would imagine just for the part of file in zip.  For ease, it 
would just be replicated or not placed in that zip at all.  I'm trying to 
figure out which approach you took.

This affects the best approach to fix the too many volumes open issue.  

I have a couple of ideas:

1) One would be to have a single signature zip file per date that is backed up. 
 That would cut down on the amount of parts of the zip.  This would take a good 
bit of re-writing though.

2) Decompress the signatures into temp folder or something.  This could take up 
a good bit of space, but of course a lot less than a whole file would.  This 
could be easy or complicated to write code for.  It depends how you are packing 
the signatures in the zip files.  This is why I asked you the question above 
about how you were doing the signatures.  

3) Database file that stores all the signatures, then do diffs on the database 
as you do incremental subsequent backups.  Or just store the Insert queries in 
a file and build the database back up from that.  

Those are my ideas so far.  A couple of answers would be helpful though.

Thanks,
Chris  

Original comment by cwhitehe...@gmail.com on 25 Feb 2013 at 1:47

GoogleCodeExporter commented 9 years ago
More questions...

Ok, I got the signature thing.

What is the reason for the hashes over your zip files in the Manifest file?  Is 
it strictly for data integrity reasons?  Integrity of each zip?  There is 
definitely some improvements you can get in speed by using something like 
MurMur3 hash. Don't need the cryptography part of SHA256.

Do you do rsync over all files to determine if they have changed?  I'm thinking 
you do.  Wouldn't it be faster to have a hash signature of each whole file to 
be backed up?  Then if the hash is different, do rsync?  Or are you concerned 
about collisions?

Original comment by cwhitehe...@gmail.com on 26 Feb 2013 at 10:03

GoogleCodeExporter commented 9 years ago
Yes, the patch-xxx.zip files in /tmp are the signature files renamed.
The is an issue somewhere requesting that these are just read from the cache 
folder, and not copied into temp as that is a waste of time, but it requires 
some rewrite to get working.

The presence of a signature file is considered "proof" that the matching 
content/diff data is in place.
This means that the signature for a file is not uploaded until the file is 
fully backed up (i.e. after all split parts are stored).
This simplifies the incremental backup stage, as it can simply assume that once 
it discovers a signature file, the data is there, and there is no need to 
examine the content volumes.

1) That is most likely not possible. The backends may have pr-file size-limits, 
so simply collecting signatures would create a large signature volume. It also 
defeats the idea that you can stop the backup at any time (i.e. cable is cut), 
and still continue next time.

2) I think this would cause a huge overhead. Each signature file should only be 
used once.

3) Yes, that was an idea I had a while back. Instead of the signature cache 
storing files, simply build a database of the data. That would speed up many 
things and allow local queries.

For the hash, yes, it is purely for integrity reasons. The SHA256 has advanced 
tamper protection, and the time spent hashing the compressed volume is not 
currently a limiting factor.

Normally, the timestamp is checked to see if a full rsync should be done. If 
the timestamp has not changed, the file is not checked. When the file is 
checked, a full signature is calculated. If the signature differs from the 
previous version, the file is read again to produce the diff. This two-pass is 
a bit annoying and has some special handling for cases where the file is opened 
by another process. It can in theory be handled by generating the diff at the 
same time the signature is calculated, but it requires some tricks to avoid 
writing the content data if the file turns out to not have changed. Not too 
hard though, as the diff will only be a few bytes, but still some work.

My idea for solving this would be a two-stage solution:
S1) Simply modify the compression wrapper to open/close the files as needed.
This should be reasonably simple to do, but will have a negative impact on 
performance.

S2) After testing (S1) works, implement a "open file MRU cache", so there is a 
limit on how many signature files are kept open. With a little luck, this will 
almost remove the overhead of the open/close stuff. It should be possible to 
hide this detail in the compressionwrapper as well, so the interfacing code 
does not need to change.

I have, however, started on a more radical solution, which is a block-based 
format. You can see the code in sandboxes/Kenneth/ForestHash. This new format 
actually uses a local database to keep track of which blocks are where, and can 
operate in an entirely file-by-file manner, and is thus not prone to the 
too-many-files-open problem. It still requires quite some tuning and cleanup so 
it is not ready to be used in production yet. At this point, I am not sure if 
the current full/incremental approach should be deprecated or not, when the 
block-based stuff is done.

Original comment by kenneth@hexad.dk on 26 Feb 2013 at 1:30

GoogleCodeExporter commented 9 years ago
I was looking at it more this morning.  

There looks like there is a possible simple fix to this issue.  There might 
need to be a 3rd constructor added to RsyncDir though or either a rewrite of 
the one that deals with incrementals.

I'm just gonna have to examine that patches variable and make sure what all it 
is being used for.  I also thought about what you were saying about the 
compression wrapper.  Have a way to release the zips after info is loaded into 
the object.  That could be a pain though, not for sure yet.

Alright Kenneth, have a good day man...

Original comment by cwhitehe...@gmail.com on 26 Feb 2013 at 2:51

GoogleCodeExporter commented 9 years ago
Alright, found a very simple fix for this issue!

It looks like a solid fix, even though not necessarily the best fix.  There are 
a lot of things that look like could be made more efficient or at least I would 
have chosen a different way to do a couple of things.  Doesn't mean right or 
wrong and I'm sure as you were coding things came up that you had to deal with 
in one way or another.

Force a dispose() at the end of:

public RSyncDir(string[] sourcefolder, CommunicationStatistics stat, 
Backupmetadata metadata, Utility.FilenameFilter filter, 
List<KeyValuePair<ManifestEntry, CompressionWrapper>> patches) 

on patches.Value

So basically at the end of that constructor loop, after last use of z variable.

I rewrote the loop a little different, changed the foreach over to a a for.

CODE:

public RSyncDir(string[] sourcefolder, CommunicationStatistics stat, 
Backupmetadata metadata, Utility.FilenameFilter filter, 
List<KeyValuePair<ManifestEntry, CompressionWrapper>> patches)
            : this(sourcefolder, stat, metadata, filter)
        {
            string[] prefixes = new string[] {
                Utility.Utility.AppendDirSeparator(COMBINED_SIGNATURE_ROOT),
                Utility.Utility.AppendDirSeparator(CONTENT_SIGNATURE_ROOT),
                Utility.Utility.AppendDirSeparator(DELTA_SIGNATURE_ROOT)
            };
            /*  NOT USED
            m_patches = new List<CompressionWrapper>();
            foreach (KeyValuePair<ManifestEntry, CompressionWrapper> patch in patches)
                m_patches.Add(patch.Value);
            */

            for(int tI = 0; tI < patches.Count;tI++)            
            {
                CompressionWrapper z = patches[tI].Value;

                if (z.FileExists(DELETED_FILES))
                    foreach (string s in z.ReadPathLines(DELETED_FILES))
                    {
                        m_oldSignatures.Remove(s);
                        m_lastVerificationTime.Remove(s);
                        m_oldSymlinks.Remove(s);
                    }

                foreach (string prefix in prefixes)
                {
                    ArchiveWrapper aw = new ArchiveWrapper(z, patches[tI].Key.Time.ToUniversalTime(), prefix);
                    foreach (string f in z.ListFiles(prefix))
                    {
                        string name = f.Substring(prefix.Length);
                        m_oldSignatures[name] = aw;
                        m_lastVerificationTime.Remove(name);
                    }
                }

                string symlinkprefix = Utility.Utility.AppendDirSeparator(SYMLINK_ROOT);
                foreach(string s in z.ListFiles(symlinkprefix))
                    m_oldSymlinks[s.Substring(symlinkprefix.Length)] = z.ReadPathString(s);

                if (z.FileExists(UNMODIFIED_FILES))
                    foreach (string s in z.ReadPathLines(UNMODIFIED_FILES))
                        m_lastVerificationTime[s] = patches[tI].Key.Time.ToUniversalTime();

                if (z.FileExists(DELETED_FOLDERS))
                    foreach (string s in z.ReadPathLines(DELETED_FOLDERS))
                        m_oldFolders.Remove(s);

                if (z.FileExists(ADDED_FOLDERS))
                {
                    DateTime t = z.GetLastWriteTime(ADDED_FOLDERS).ToUniversalTime();
                    string[] filenames = z.ReadPathLines(ADDED_FOLDERS);

                    if (z.FileExists(ADDED_FOLDERS_TIMESTAMPS))
                    {
                        string[] timestamps = z.ReadAllLines(ADDED_FOLDERS_TIMESTAMPS);
                        long l;
                        for(int i = 0; i < Math.Min(filenames.Length, timestamps.Length); i++)
                            if (long.TryParse(timestamps[i], out l))
                                m_oldFolders[filenames[i]] = Library.Utility.Utility.EPOCH.AddSeconds(l);
                            else
                                m_oldFolders[filenames[i]] = t;
                    }
                    else
                    {
                        foreach (string s in filenames)
                            m_oldFolders[s] = t;
                    }
                }

                if (z.FileExists(UPDATED_FOLDERS) && z.FileExists(UPDATED_FOLDERS_TIMESTAMPS))
                {
                    string[] filenames = z.ReadPathLines(UPDATED_FOLDERS);
                    string[] timestamps = z.ReadAllLines(UPDATED_FOLDERS_TIMESTAMPS);
                    long l;

                    for (int i = 0; i < Math.Min(filenames.Length, timestamps.Length); i++)
                        if (long.TryParse(timestamps[i], out l))
                            m_oldFolders[filenames[i]] = Utility.Utility.EPOCH.AddSeconds(l);
                }

                //The latest file is the most valid
                if (z.FileExists(USN_VALUES))
                    using (System.IO.Stream s = z.OpenRead(USN_VALUES))
                        m_lastUSN = new USNRecord(s);

                patches[tI].Value.Dispose();
            }
        }

Original comment by cwhitehe...@gmail.com on 2 Mar 2013 at 2:05

GoogleCodeExporter commented 9 years ago
By the way, you may want to un-comment that m_patches variable.  I changed some 
other parts of the code since that variable is never used.

Original comment by cwhitehe...@gmail.com on 2 Mar 2013 at 2:08

GoogleCodeExporter commented 9 years ago
Here is the code using the original foreach loop:

CODE:

public RSyncDir(string[] sourcefolder, CommunicationStatistics stat, 
Backupmetadata metadata, Utility.FilenameFilter filter, 
List<KeyValuePair<ManifestEntry, CompressionWrapper>> patches)
            : this(sourcefolder, stat, metadata, filter)
        {
            string[] prefixes = new string[] {
                Utility.Utility.AppendDirSeparator(COMBINED_SIGNATURE_ROOT),
                Utility.Utility.AppendDirSeparator(CONTENT_SIGNATURE_ROOT),
                Utility.Utility.AppendDirSeparator(DELTA_SIGNATURE_ROOT)
            };
            /*  NOT USED
            m_patches = new List<CompressionWrapper>();
            foreach (KeyValuePair<ManifestEntry, CompressionWrapper> patch in patches)
                m_patches.Add(patch.Value);
            */

            foreach(KeyValuePair<ManifestEntry, CompressionWrapper> patch in patches) 
            {
                CompressionWrapper z = patch.Value;

                if (z.FileExists(DELETED_FILES))
                    foreach (string s in z.ReadPathLines(DELETED_FILES))
                    {
                        m_oldSignatures.Remove(s);
                        m_lastVerificationTime.Remove(s);
                        m_oldSymlinks.Remove(s);
                    }

                foreach (string prefix in prefixes)
                {
                    ArchiveWrapper aw = new ArchiveWrapper(z, patch.Key.Time.ToUniversalTime(), prefix);
                    foreach (string f in z.ListFiles(prefix))
                    {
                        string name = f.Substring(prefix.Length);
                        m_oldSignatures[name] = aw;
                        m_lastVerificationTime.Remove(name);
                    }
                }

                string symlinkprefix = Utility.Utility.AppendDirSeparator(SYMLINK_ROOT);
                foreach(string s in z.ListFiles(symlinkprefix))
                    m_oldSymlinks[s.Substring(symlinkprefix.Length)] = z.ReadPathString(s);

                if (z.FileExists(UNMODIFIED_FILES))
                    foreach (string s in z.ReadPathLines(UNMODIFIED_FILES))
                        m_lastVerificationTime[s] = patch.Key.Time.ToUniversalTime();

                if (z.FileExists(DELETED_FOLDERS))
                    foreach (string s in z.ReadPathLines(DELETED_FOLDERS))
                        m_oldFolders.Remove(s);

                if (z.FileExists(ADDED_FOLDERS))
                {
                    DateTime t = z.GetLastWriteTime(ADDED_FOLDERS).ToUniversalTime();
                    string[] filenames = z.ReadPathLines(ADDED_FOLDERS);

                    if (z.FileExists(ADDED_FOLDERS_TIMESTAMPS))
                    {
                        string[] timestamps = z.ReadAllLines(ADDED_FOLDERS_TIMESTAMPS);
                        long l;
                        for(int i = 0; i < Math.Min(filenames.Length, timestamps.Length); i++)
                            if (long.TryParse(timestamps[i], out l))
                                m_oldFolders[filenames[i]] = Library.Utility.Utility.EPOCH.AddSeconds(l);
                            else
                                m_oldFolders[filenames[i]] = t;
                    }
                    else
                    {
                        foreach (string s in filenames)
                            m_oldFolders[s] = t;
                    }
                }

                if (z.FileExists(UPDATED_FOLDERS) && z.FileExists(UPDATED_FOLDERS_TIMESTAMPS))
                {
                    string[] filenames = z.ReadPathLines(UPDATED_FOLDERS);
                    string[] timestamps = z.ReadAllLines(UPDATED_FOLDERS_TIMESTAMPS);
                    long l;

                    for (int i = 0; i < Math.Min(filenames.Length, timestamps.Length); i++)
                        if (long.TryParse(timestamps[i], out l))
                            m_oldFolders[filenames[i]] = Utility.Utility.EPOCH.AddSeconds(l);
                }

                //The latest file is the most valid
                if (z.FileExists(USN_VALUES))
                    using (System.IO.Stream s = z.OpenRead(USN_VALUES))
                        m_lastUSN = new USNRecord(s);

                patch.Value.Dispose();
            }
        }

END CODE

If you don't want to use the Dispose(), then the constructor signature will 
have to be changed or either add an additional constructor.  Otherwise will 
have to figure out a way in CompressionWrapper to monitor the use of that 
object.  It will end up taking up more resources that way.  But if you want to 
keep it in the CompressionWrapper class, it can be done.

Have a good weekend man,

Chris

Original comment by cwhitehe...@gmail.com on 2 Mar 2013 at 3:52

GoogleCodeExporter commented 9 years ago
Issue 798 has been merged into this issue.

Original comment by rst...@gmail.com on 3 Mar 2013 at 8:49

GoogleCodeExporter commented 9 years ago
This issue was updated by revision ab8b3e5ca168.

Removed the m_patches variable that is not used.

Original comment by kenneth@hexad.dk on 8 Mar 2013 at 10:18

GoogleCodeExporter commented 9 years ago
You cannot just dispose the CompressionWrapper there, as that will also close 
the archive.
That will create an error when accessing m_oldSignatures[relname] later on.

Did you try your patch?

(Nice catch with the other m_patches variable :) )

Original comment by kenneth@hexad.dk on 8 Mar 2013 at 10:25

GoogleCodeExporter commented 9 years ago
Yea, works...

I traced the patches back over into Interfaces and that isn't used after it 
used in that part of RsyncDir.

You sure?

Original comment by cwhitehe...@gmail.com on 8 Mar 2013 at 11:09

GoogleCodeExporter commented 9 years ago
I'm re-writing that whole part of code in Interfaces, been dabbling on and off 
with it this week.  Instead of passing that patches object over, was trying to 
pass the objects over to construct a compressor over in RsyncDir.  entries 
looks like it holds pretty much everything to construct a CompressionWrapper 
with.

Original comment by cwhitehe...@gmail.com on 8 Mar 2013 at 11:13

GoogleCodeExporter commented 9 years ago
Yes, positive:
the variable z is assigned here:
https://code.google.com/p/duplicati/source/browse/Duplicati/Library/Main/RSyncDi
r.cs#672

And a reference is kept through ArchiveWrapper in m_oldSignatures here:
https://code.google.com/p/duplicati/source/browse/Duplicati/Library/Main/RSyncDi
r.cs#684

And used here:
https://code.google.com/p/duplicati/source/browse/Duplicati/Library/Main/RSyncDi
r.cs#1623

If it is disposed, that read will fail. It will not be accessed if you do a 
full backup though.

Original comment by kenneth@hexad.dk on 8 Mar 2013 at 11:14

GoogleCodeExporter commented 9 years ago
I think I know what it didn't fail.  I had a huge file that I changed to split 
to 1mb instead of a bunch of files.  So it would split a 1.2Gb up into 1200 or 
so archives.  I'm guessing because there was only one signature in that file, 
maybe that is why it did not fail.

I'll do more research when I get home...

Thanks for the pointers. ;)

Original comment by cwhitehe...@gmail.com on 8 Mar 2013 at 11:29

GoogleCodeExporter commented 9 years ago
Yep...  You are right!

Damn, didn't notice that.  This is going to be a bit more complex than expected 
to deal with.  Didn't realize ArchiveWrapper was holding on to that part of 
patches object.  Going to be tough to make this efficient.  

Original comment by cwhitehe...@gmail.com on 9 Mar 2013 at 1:20

GoogleCodeExporter commented 9 years ago
Yes. I think the only way to solve it is to keep a pool of open files. 

That will not perform the best possible way, but should put a damper on the 
number of times files are opened and closed, without keeping too many files 
open.

Original comment by kenneth@hexad.dk on 14 Mar 2013 at 9:19

GoogleCodeExporter commented 9 years ago
I was looking through trying to figure out exactly what signatures were
being used by ArchiveWrapper.

I'm not for sure at what point the code is using the rsync generated
signatures.  Does it only use those if it fails another check?  Like a
SHA256 check, then it uses the rsync signatures to figure out what part of
file has changed?

I started writing code for database to cache the signatures.  Dealing with
multi-OS looks like a real pain in the butt.

Here is what I was thinking:
Either save the smaller signatures in memory or a database.  Then when it
has to use the larger signatures the archive will just have to be re-opened.

Original comment by cwhitehe...@gmail.com on 14 Mar 2013 at 2:37

GoogleCodeExporter commented 9 years ago
Issue 799 has been merged into this issue.

Original comment by kenneth@hexad.dk on 21 Apr 2013 at 7:35

GoogleCodeExporter commented 9 years ago
Not sure if you're accepting "me too's", but Duplicati 1.3.4 on Ubuntu Raring 
13.04 is hanging for me at about 30%.  This last time it was hanging on a 
simple small old html that I haven't touched in years.  

$ lsof -p $(pidof Duplicati) | wc -l
1651

Original comment by pjala...@gigalock.com on 11 Jun 2013 at 3:35

GoogleCodeExporter commented 9 years ago
If they are working on a platform-neutral GUI interface, fixing this bug in 
this .NET/MONO version might just be ignored.

Original comment by yfr...@gmail.com on 11 Jun 2013 at 3:49

GoogleCodeExporter commented 9 years ago
I am aware of the problem, but fixing it requires some deeper changes to how 
the temporary files are handled.
The workaround at the moment is to use a large volume size, as that decreases 
the number of files that needs to be opened.

I have suggested using a "pool" of open files, but a better rewrite would be to 
not store the signature files individually but rather store the values in a 
database. Once you do this, a lot of new options open up.

I am currently building a new storage format that uses a local database to keep 
track of all the files and signatures. The benefits of this new format are 
many, such as de-duplication, moved file detection, no backup chains, fast 
queries, etc.

I am not working on this particular issue, because it will eventually be 
resolved once 2.0 is out. However, we are heavily delayed with writing the new 
GUI so there will probably be a 2.0 CLI preview version released with the new 
storage format, so you can try it out while we are building the new UI.

Original comment by kenneth@hexad.dk on 17 Jun 2013 at 9:08