Open GoogleCodeExporter opened 9 years ago
I've finally had some time to do some research on this and somewhat understand
what is going on. Heh, I told ya it might take me a bit. I'm somewhat
comfortable with MonoDevelop now though. I also had a little down time over
the weekend to follow what the code is doing.
There is a question I'm wondering though. Didn't follow the code to see what
was going on with this part. Are the patch files that are getting stored in
/tmp(linux location) just you renaming signature files to patch.zip files?
That is what it looks like to me. I wouldn't think you would recompress that
since it looks to be the same with just different naming on the file. I did
find a variable that wasn't being used that was taking up extra memory in the
same part of the RsyncDir part of code though(m_patches). Maybe there was some
idea you had to use that for, but abandoned it later, I know how that goes. :)
Here is what I think is going on:
Basically when you have a backup, most likely going to be a full or files with
a lot of data, it splits into a lot of volumes for that date. Are the
signatures just for the data located in that same zip? Or is the signature
file replicated between splits? Your data for a file is more than 10Mb you are
backing up. Does it only store signature for that part of that file in the zip?
Do you understand?
For efficiency I would imagine just for the part of file in zip. For ease, it
would just be replicated or not placed in that zip at all. I'm trying to
figure out which approach you took.
This affects the best approach to fix the too many volumes open issue.
I have a couple of ideas:
1) One would be to have a single signature zip file per date that is backed up.
That would cut down on the amount of parts of the zip. This would take a good
bit of re-writing though.
2) Decompress the signatures into temp folder or something. This could take up
a good bit of space, but of course a lot less than a whole file would. This
could be easy or complicated to write code for. It depends how you are packing
the signatures in the zip files. This is why I asked you the question above
about how you were doing the signatures.
3) Database file that stores all the signatures, then do diffs on the database
as you do incremental subsequent backups. Or just store the Insert queries in
a file and build the database back up from that.
Those are my ideas so far. A couple of answers would be helpful though.
Thanks,
Chris
Original comment by cwhitehe...@gmail.com
on 25 Feb 2013 at 1:47
More questions...
Ok, I got the signature thing.
What is the reason for the hashes over your zip files in the Manifest file? Is
it strictly for data integrity reasons? Integrity of each zip? There is
definitely some improvements you can get in speed by using something like
MurMur3 hash. Don't need the cryptography part of SHA256.
Do you do rsync over all files to determine if they have changed? I'm thinking
you do. Wouldn't it be faster to have a hash signature of each whole file to
be backed up? Then if the hash is different, do rsync? Or are you concerned
about collisions?
Original comment by cwhitehe...@gmail.com
on 26 Feb 2013 at 10:03
Yes, the patch-xxx.zip files in /tmp are the signature files renamed.
The is an issue somewhere requesting that these are just read from the cache
folder, and not copied into temp as that is a waste of time, but it requires
some rewrite to get working.
The presence of a signature file is considered "proof" that the matching
content/diff data is in place.
This means that the signature for a file is not uploaded until the file is
fully backed up (i.e. after all split parts are stored).
This simplifies the incremental backup stage, as it can simply assume that once
it discovers a signature file, the data is there, and there is no need to
examine the content volumes.
1) That is most likely not possible. The backends may have pr-file size-limits,
so simply collecting signatures would create a large signature volume. It also
defeats the idea that you can stop the backup at any time (i.e. cable is cut),
and still continue next time.
2) I think this would cause a huge overhead. Each signature file should only be
used once.
3) Yes, that was an idea I had a while back. Instead of the signature cache
storing files, simply build a database of the data. That would speed up many
things and allow local queries.
For the hash, yes, it is purely for integrity reasons. The SHA256 has advanced
tamper protection, and the time spent hashing the compressed volume is not
currently a limiting factor.
Normally, the timestamp is checked to see if a full rsync should be done. If
the timestamp has not changed, the file is not checked. When the file is
checked, a full signature is calculated. If the signature differs from the
previous version, the file is read again to produce the diff. This two-pass is
a bit annoying and has some special handling for cases where the file is opened
by another process. It can in theory be handled by generating the diff at the
same time the signature is calculated, but it requires some tricks to avoid
writing the content data if the file turns out to not have changed. Not too
hard though, as the diff will only be a few bytes, but still some work.
My idea for solving this would be a two-stage solution:
S1) Simply modify the compression wrapper to open/close the files as needed.
This should be reasonably simple to do, but will have a negative impact on
performance.
S2) After testing (S1) works, implement a "open file MRU cache", so there is a
limit on how many signature files are kept open. With a little luck, this will
almost remove the overhead of the open/close stuff. It should be possible to
hide this detail in the compressionwrapper as well, so the interfacing code
does not need to change.
I have, however, started on a more radical solution, which is a block-based
format. You can see the code in sandboxes/Kenneth/ForestHash. This new format
actually uses a local database to keep track of which blocks are where, and can
operate in an entirely file-by-file manner, and is thus not prone to the
too-many-files-open problem. It still requires quite some tuning and cleanup so
it is not ready to be used in production yet. At this point, I am not sure if
the current full/incremental approach should be deprecated or not, when the
block-based stuff is done.
Original comment by kenneth@hexad.dk
on 26 Feb 2013 at 1:30
I was looking at it more this morning.
There looks like there is a possible simple fix to this issue. There might
need to be a 3rd constructor added to RsyncDir though or either a rewrite of
the one that deals with incrementals.
I'm just gonna have to examine that patches variable and make sure what all it
is being used for. I also thought about what you were saying about the
compression wrapper. Have a way to release the zips after info is loaded into
the object. That could be a pain though, not for sure yet.
Alright Kenneth, have a good day man...
Original comment by cwhitehe...@gmail.com
on 26 Feb 2013 at 2:51
Alright, found a very simple fix for this issue!
It looks like a solid fix, even though not necessarily the best fix. There are
a lot of things that look like could be made more efficient or at least I would
have chosen a different way to do a couple of things. Doesn't mean right or
wrong and I'm sure as you were coding things came up that you had to deal with
in one way or another.
Force a dispose() at the end of:
public RSyncDir(string[] sourcefolder, CommunicationStatistics stat,
Backupmetadata metadata, Utility.FilenameFilter filter,
List<KeyValuePair<ManifestEntry, CompressionWrapper>> patches)
on patches.Value
So basically at the end of that constructor loop, after last use of z variable.
I rewrote the loop a little different, changed the foreach over to a a for.
CODE:
public RSyncDir(string[] sourcefolder, CommunicationStatistics stat,
Backupmetadata metadata, Utility.FilenameFilter filter,
List<KeyValuePair<ManifestEntry, CompressionWrapper>> patches)
: this(sourcefolder, stat, metadata, filter)
{
string[] prefixes = new string[] {
Utility.Utility.AppendDirSeparator(COMBINED_SIGNATURE_ROOT),
Utility.Utility.AppendDirSeparator(CONTENT_SIGNATURE_ROOT),
Utility.Utility.AppendDirSeparator(DELTA_SIGNATURE_ROOT)
};
/* NOT USED
m_patches = new List<CompressionWrapper>();
foreach (KeyValuePair<ManifestEntry, CompressionWrapper> patch in patches)
m_patches.Add(patch.Value);
*/
for(int tI = 0; tI < patches.Count;tI++)
{
CompressionWrapper z = patches[tI].Value;
if (z.FileExists(DELETED_FILES))
foreach (string s in z.ReadPathLines(DELETED_FILES))
{
m_oldSignatures.Remove(s);
m_lastVerificationTime.Remove(s);
m_oldSymlinks.Remove(s);
}
foreach (string prefix in prefixes)
{
ArchiveWrapper aw = new ArchiveWrapper(z, patches[tI].Key.Time.ToUniversalTime(), prefix);
foreach (string f in z.ListFiles(prefix))
{
string name = f.Substring(prefix.Length);
m_oldSignatures[name] = aw;
m_lastVerificationTime.Remove(name);
}
}
string symlinkprefix = Utility.Utility.AppendDirSeparator(SYMLINK_ROOT);
foreach(string s in z.ListFiles(symlinkprefix))
m_oldSymlinks[s.Substring(symlinkprefix.Length)] = z.ReadPathString(s);
if (z.FileExists(UNMODIFIED_FILES))
foreach (string s in z.ReadPathLines(UNMODIFIED_FILES))
m_lastVerificationTime[s] = patches[tI].Key.Time.ToUniversalTime();
if (z.FileExists(DELETED_FOLDERS))
foreach (string s in z.ReadPathLines(DELETED_FOLDERS))
m_oldFolders.Remove(s);
if (z.FileExists(ADDED_FOLDERS))
{
DateTime t = z.GetLastWriteTime(ADDED_FOLDERS).ToUniversalTime();
string[] filenames = z.ReadPathLines(ADDED_FOLDERS);
if (z.FileExists(ADDED_FOLDERS_TIMESTAMPS))
{
string[] timestamps = z.ReadAllLines(ADDED_FOLDERS_TIMESTAMPS);
long l;
for(int i = 0; i < Math.Min(filenames.Length, timestamps.Length); i++)
if (long.TryParse(timestamps[i], out l))
m_oldFolders[filenames[i]] = Library.Utility.Utility.EPOCH.AddSeconds(l);
else
m_oldFolders[filenames[i]] = t;
}
else
{
foreach (string s in filenames)
m_oldFolders[s] = t;
}
}
if (z.FileExists(UPDATED_FOLDERS) && z.FileExists(UPDATED_FOLDERS_TIMESTAMPS))
{
string[] filenames = z.ReadPathLines(UPDATED_FOLDERS);
string[] timestamps = z.ReadAllLines(UPDATED_FOLDERS_TIMESTAMPS);
long l;
for (int i = 0; i < Math.Min(filenames.Length, timestamps.Length); i++)
if (long.TryParse(timestamps[i], out l))
m_oldFolders[filenames[i]] = Utility.Utility.EPOCH.AddSeconds(l);
}
//The latest file is the most valid
if (z.FileExists(USN_VALUES))
using (System.IO.Stream s = z.OpenRead(USN_VALUES))
m_lastUSN = new USNRecord(s);
patches[tI].Value.Dispose();
}
}
Original comment by cwhitehe...@gmail.com
on 2 Mar 2013 at 2:05
By the way, you may want to un-comment that m_patches variable. I changed some
other parts of the code since that variable is never used.
Original comment by cwhitehe...@gmail.com
on 2 Mar 2013 at 2:08
Here is the code using the original foreach loop:
CODE:
public RSyncDir(string[] sourcefolder, CommunicationStatistics stat,
Backupmetadata metadata, Utility.FilenameFilter filter,
List<KeyValuePair<ManifestEntry, CompressionWrapper>> patches)
: this(sourcefolder, stat, metadata, filter)
{
string[] prefixes = new string[] {
Utility.Utility.AppendDirSeparator(COMBINED_SIGNATURE_ROOT),
Utility.Utility.AppendDirSeparator(CONTENT_SIGNATURE_ROOT),
Utility.Utility.AppendDirSeparator(DELTA_SIGNATURE_ROOT)
};
/* NOT USED
m_patches = new List<CompressionWrapper>();
foreach (KeyValuePair<ManifestEntry, CompressionWrapper> patch in patches)
m_patches.Add(patch.Value);
*/
foreach(KeyValuePair<ManifestEntry, CompressionWrapper> patch in patches)
{
CompressionWrapper z = patch.Value;
if (z.FileExists(DELETED_FILES))
foreach (string s in z.ReadPathLines(DELETED_FILES))
{
m_oldSignatures.Remove(s);
m_lastVerificationTime.Remove(s);
m_oldSymlinks.Remove(s);
}
foreach (string prefix in prefixes)
{
ArchiveWrapper aw = new ArchiveWrapper(z, patch.Key.Time.ToUniversalTime(), prefix);
foreach (string f in z.ListFiles(prefix))
{
string name = f.Substring(prefix.Length);
m_oldSignatures[name] = aw;
m_lastVerificationTime.Remove(name);
}
}
string symlinkprefix = Utility.Utility.AppendDirSeparator(SYMLINK_ROOT);
foreach(string s in z.ListFiles(symlinkprefix))
m_oldSymlinks[s.Substring(symlinkprefix.Length)] = z.ReadPathString(s);
if (z.FileExists(UNMODIFIED_FILES))
foreach (string s in z.ReadPathLines(UNMODIFIED_FILES))
m_lastVerificationTime[s] = patch.Key.Time.ToUniversalTime();
if (z.FileExists(DELETED_FOLDERS))
foreach (string s in z.ReadPathLines(DELETED_FOLDERS))
m_oldFolders.Remove(s);
if (z.FileExists(ADDED_FOLDERS))
{
DateTime t = z.GetLastWriteTime(ADDED_FOLDERS).ToUniversalTime();
string[] filenames = z.ReadPathLines(ADDED_FOLDERS);
if (z.FileExists(ADDED_FOLDERS_TIMESTAMPS))
{
string[] timestamps = z.ReadAllLines(ADDED_FOLDERS_TIMESTAMPS);
long l;
for(int i = 0; i < Math.Min(filenames.Length, timestamps.Length); i++)
if (long.TryParse(timestamps[i], out l))
m_oldFolders[filenames[i]] = Library.Utility.Utility.EPOCH.AddSeconds(l);
else
m_oldFolders[filenames[i]] = t;
}
else
{
foreach (string s in filenames)
m_oldFolders[s] = t;
}
}
if (z.FileExists(UPDATED_FOLDERS) && z.FileExists(UPDATED_FOLDERS_TIMESTAMPS))
{
string[] filenames = z.ReadPathLines(UPDATED_FOLDERS);
string[] timestamps = z.ReadAllLines(UPDATED_FOLDERS_TIMESTAMPS);
long l;
for (int i = 0; i < Math.Min(filenames.Length, timestamps.Length); i++)
if (long.TryParse(timestamps[i], out l))
m_oldFolders[filenames[i]] = Utility.Utility.EPOCH.AddSeconds(l);
}
//The latest file is the most valid
if (z.FileExists(USN_VALUES))
using (System.IO.Stream s = z.OpenRead(USN_VALUES))
m_lastUSN = new USNRecord(s);
patch.Value.Dispose();
}
}
END CODE
If you don't want to use the Dispose(), then the constructor signature will
have to be changed or either add an additional constructor. Otherwise will
have to figure out a way in CompressionWrapper to monitor the use of that
object. It will end up taking up more resources that way. But if you want to
keep it in the CompressionWrapper class, it can be done.
Have a good weekend man,
Chris
Original comment by cwhitehe...@gmail.com
on 2 Mar 2013 at 3:52
Issue 798 has been merged into this issue.
Original comment by rst...@gmail.com
on 3 Mar 2013 at 8:49
This issue was updated by revision ab8b3e5ca168.
Removed the m_patches variable that is not used.
Original comment by kenneth@hexad.dk
on 8 Mar 2013 at 10:18
You cannot just dispose the CompressionWrapper there, as that will also close
the archive.
That will create an error when accessing m_oldSignatures[relname] later on.
Did you try your patch?
(Nice catch with the other m_patches variable :) )
Original comment by kenneth@hexad.dk
on 8 Mar 2013 at 10:25
Yea, works...
I traced the patches back over into Interfaces and that isn't used after it
used in that part of RsyncDir.
You sure?
Original comment by cwhitehe...@gmail.com
on 8 Mar 2013 at 11:09
I'm re-writing that whole part of code in Interfaces, been dabbling on and off
with it this week. Instead of passing that patches object over, was trying to
pass the objects over to construct a compressor over in RsyncDir. entries
looks like it holds pretty much everything to construct a CompressionWrapper
with.
Original comment by cwhitehe...@gmail.com
on 8 Mar 2013 at 11:13
Yes, positive:
the variable z is assigned here:
https://code.google.com/p/duplicati/source/browse/Duplicati/Library/Main/RSyncDi
r.cs#672
And a reference is kept through ArchiveWrapper in m_oldSignatures here:
https://code.google.com/p/duplicati/source/browse/Duplicati/Library/Main/RSyncDi
r.cs#684
And used here:
https://code.google.com/p/duplicati/source/browse/Duplicati/Library/Main/RSyncDi
r.cs#1623
If it is disposed, that read will fail. It will not be accessed if you do a
full backup though.
Original comment by kenneth@hexad.dk
on 8 Mar 2013 at 11:14
I think I know what it didn't fail. I had a huge file that I changed to split
to 1mb instead of a bunch of files. So it would split a 1.2Gb up into 1200 or
so archives. I'm guessing because there was only one signature in that file,
maybe that is why it did not fail.
I'll do more research when I get home...
Thanks for the pointers. ;)
Original comment by cwhitehe...@gmail.com
on 8 Mar 2013 at 11:29
Yep... You are right!
Damn, didn't notice that. This is going to be a bit more complex than expected
to deal with. Didn't realize ArchiveWrapper was holding on to that part of
patches object. Going to be tough to make this efficient.
Original comment by cwhitehe...@gmail.com
on 9 Mar 2013 at 1:20
Yes. I think the only way to solve it is to keep a pool of open files.
That will not perform the best possible way, but should put a damper on the
number of times files are opened and closed, without keeping too many files
open.
Original comment by kenneth@hexad.dk
on 14 Mar 2013 at 9:19
I was looking through trying to figure out exactly what signatures were
being used by ArchiveWrapper.
I'm not for sure at what point the code is using the rsync generated
signatures. Does it only use those if it fails another check? Like a
SHA256 check, then it uses the rsync signatures to figure out what part of
file has changed?
I started writing code for database to cache the signatures. Dealing with
multi-OS looks like a real pain in the butt.
Here is what I was thinking:
Either save the smaller signatures in memory or a database. Then when it
has to use the larger signatures the archive will just have to be re-opened.
Original comment by cwhitehe...@gmail.com
on 14 Mar 2013 at 2:37
Issue 799 has been merged into this issue.
Original comment by kenneth@hexad.dk
on 21 Apr 2013 at 7:35
Not sure if you're accepting "me too's", but Duplicati 1.3.4 on Ubuntu Raring
13.04 is hanging for me at about 30%. This last time it was hanging on a
simple small old html that I haven't touched in years.
$ lsof -p $(pidof Duplicati) | wc -l
1651
Original comment by pjala...@gigalock.com
on 11 Jun 2013 at 3:35
If they are working on a platform-neutral GUI interface, fixing this bug in
this .NET/MONO version might just be ignored.
Original comment by yfr...@gmail.com
on 11 Jun 2013 at 3:49
I am aware of the problem, but fixing it requires some deeper changes to how
the temporary files are handled.
The workaround at the moment is to use a large volume size, as that decreases
the number of files that needs to be opened.
I have suggested using a "pool" of open files, but a better rewrite would be to
not store the signature files individually but rather store the values in a
database. Once you do this, a lot of new options open up.
I am currently building a new storage format that uses a local database to keep
track of all the files and signatures. The benefits of this new format are
many, such as de-duplication, moved file detection, no backup chains, fast
queries, etc.
I am not working on this particular issue, because it will eventually be
resolved once 2.0 is out. However, we are heavily delayed with writing the new
GUI so there will probably be a 2.0 CLI preview version released with the new
storage format, so you can try it out while we are building the new UI.
Original comment by kenneth@hexad.dk
on 17 Jun 2013 at 9:08
Original issue reported on code.google.com by
kenneth@hexad.dk
on 13 Dec 2012 at 9:18