Interoperability with RSync

GoogleCodeExporter commented 9 years ago

Duplicati has been designed with different backends (without dedicated backup 
software) in mind. For comparison Duplicity requires RSync. However, it would 
be nice if Duplicati would allow better interoperability with RSync; allowing a 
fixed set of files where only diff needs to be sent to RSync backend (similar 
to using RSync in the first place on the source files). Requirement is thus 
that full backups need to be as much alike as possible. The benefit is to allow 
use of for instance inexpensive NAS devices for backups and still keep multiple 
copies on the NAS (using RSyncs link command or by running 'cp -al' script on 
the NAS) while at the same time reducing bandwidth consumption for (full) 
backups. 

Suggestion is settings that would allow:
a) Consequtive full / incremental backup on the same source file set to produce 
same encrypted output and that varience in one file doesn't change the whole 
backup
b) Userdefined naming of files instead of the current fixed date / time format. 
Suggestions could be to include DMY, HMS as well as Weekday and that these 
different elements could be mixed and appended to the prefix.

A reoccurring weekly full backup with daily incremental backups to be copied 
over RSync could thus contain files like 'duplicati-full-content.Mon' and 
'duplicati-inc-content.Tue'. Since file names repeat every week, RSync could be 
used for off-site backups.

Original issue reported on code.google.com by Henrik68@gmail.com on 18 Sep 2010 at 11:56

GoogleCodeExporter commented 9 years ago

This is a good request.

It is currently difficult for 2 reasons:
1) The file order is deliberately random
2) The encryption key is deliberately random.

Fixing (1) can perhaps be done by extending the listing of file content to also 
report what volume the file was found in, and then attempt to use that order 
instead of the random one.

Fixing (2) is a bit harder, as that would require that the individual file key 
can be recovered in some way, without compromising it. It could perhaps be as 
simple as storing the file keys in the manifest or a similar complementary 
file, which is then encrypted. The current code does not easily allow for 
getting/setting this key however. Since the encryption uses chaining, a single 
byte change causes a cascading random effect through the rest of the file. That 
is at least for AES, I'm not sure if GPG uses a file key, or if such the key is 
even accessible.

There is also a problem with the compression which may produce large changes 
when encountering small ones. I'm not really sure how it can be done if a file 
is resized and potentially spanning multiple volumes. 

The only solution I can think of is to mimic a filesystem inside the archives, 
so there is complete control over what blocks go where, allowing parts of 
growing files to be appended and leaving space for files that shrink. But this 
does not work well with compression as that tends to leave blocks of differing 
sizes, and performing the compression on the resulting file would potentially 
cause large changes.

I see the point with this request, but I have no clear view on how to fix this 
without designing a new system with this particular issue as the focus point.

I'll leave the request as accepted, but I will not actively work on it for the 
time being.

Original comment by kenneth....@gmail.com on 20 Sep 2010 at 1:28

Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

I do not know the inner workings of Duplicati so I will take the liberty of 
thinking aloud here…
If encryption key is random then I would assume the key to be stored somewhere 
in the backup to allow restore (most likely encrypted using the key entered 
when backup was defined). Since the files are recoverable using the key stored 
in the Duplicati config, I would not see any harm for the encryption ley to be 
stored in the config as well(??) and reused during next full backup.
For the full backups couldn’t a compromise be to sway a bit away from the 
principle of fixed sized volumes? AES has a behaviour (ie. CBC) that literally 
breaks the idea of RSync; could the individual file be broken down to (user?) 
defined blocks (for instance of 2K, 8K, 64K size or …) before compression / 
encryption? Each volume could then contain a (user?) defined number of blocks. 
Each new file would mean the start of a new volume. Volume names could include 
a sequence number of the file and number of the first block in the volume. In 
case of a file is altered between two full backups only the affected block(s) 
and thus the corresponding volume(s) are changed. That should also simply 
shrink / growth scenarios. Incremental backups could apply the existing 
creation methodology (and come to think about it even existing naming standard).

Original comment by Henrik68@gmail.com on 20 Sep 2010 at 7:01

GoogleCodeExporter commented 9 years ago

Yes, the file keys could be stored twice, which would not affect security as 
they are encrypted using the backup password, good point.

The CBC method is the chaining method I mentioned. The method of encrypting 
individual blocks can be done, but must be done carefully to avoid introducing 
an ECB (Electronic Code Book) like setup.

So the plan would be:
1) Define a file system, so that file contents can be written as a continuous 
stream, eg: write file 1, n1 bytes, write file 2, n2 bytes, possibly aligned 
with a block size. Trade-of between size and minor change adaptation.
2) When making a new full backup, use the same layout, potentially leaving 
holes for shrunken files, and creating new blocks (at the end) for files that 
have grown (can also re-use holes).
3) Apply custom encryption on block-by-block basis, re-using the keys from a 
previous backup. Not necessarily the same block size as in (1).
4) Compression?

Compression must be performed before encryption as encrypted data is not 
compressible. If one were to apply compression to the entire stream, it may 
compress differently and thus cause unwanted cascading changes. If one were to 
apply compression to fixed blocks, the result would be blocks of differing 
sizes. Which leaves the option that files must be compressed individually 
before being applied to step (1) above.  Perhaps a special compression 
algorithm can be used to reduce the amount of changes within a compressed file.

It may require a large amount of temporary space as a file can become 
fragmented between many volumes and thus require that all files are downloaded 
before a restore can be completed. There is also an overhead with storing the 
archive layout information. And files with many holes are space inefficient. 

The worst part is the compression, which can potentially break everything. I 
don't know any good compression algorithms that can handle this. The gzip 
program has the --rsyncable option, but I think that the benefits of this 
particular method will be destroyed by encryption, as the compressed output is 
not synchronized with the block structure used for the encryption.

The only thing I can think of that would solve this is to make the compression 
and encryption interleaved, so the compression output blocks are encrypted 
independently. If the compression can somehow synchronize the output, like with 
the --rsyncable option, it may be possible to achieve this. In this setup, the 
file format can actually be something similar to just appending the encrypted 
file blocks after one another. Since the outputs are independent, the outputs 
should contain matching sequences, which will be detected by rsync.

Since rsync is file based, a volume cannot overflow into another volume. If the 
volume size limit is not applied, this will not be a problem, but may generate 
disproportional volume sizes. This could be handled by using overflow volumes 
that keep the overflowed data, but may cause large a increase in the number of 
files instead.

It sounds more and more like a new file format :).
Perhaps something like this already exists?

Original comment by kenneth....@gmail.com on 21 Sep 2010 at 11:22

GoogleCodeExporter commented 9 years ago

I agree. Rsync protocol support would highly be appreciated.

Original comment by michael....@gmail.com on 13 Aug 2012 at 8:55

Berimor66 / duplicati

Interoperability with RSync #270