fcorbelli / zpaqfranz

Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
MIT License
278 stars 25 forks source link

How to backup to small removable disks? #5

Closed SampsonF closed 3 years ago

SampsonF commented 3 years ago

I want to backup a large data set to small disks.

Can zpaqfranz auto create additional parts based on target size?

fcorbelli commented 3 years ago

In fact, no. I could try to develop this, but the primary use (copying to disk rather than USB) has a rather different logic

Multiparts are made containing the new "snaphosts" (aka delta data).
Suppose you have a 500GB fileserver: the first backup will be huge (maybe 400GB), the others small (some GB each).

To make this kind of multipart something like

zpaqfranz a "y:\urketto_???.zpaq" c:\zpaqfranz

to create a multipart archive, in this example

of the folder c:\zpaqfranz

It is possible to keep locally a smaller file by the -index switch

With add, create archive.zpaq as a suffix to append to a remote archive which is assumed to be identical to indexfile except that indexfile contains no compressed file contents (D blocks). Then update indexfile by appending a copy of archive.zpaq without the D blocks. With extract, specify the index to create for archive.zpaq and do not extract any files.

The purpose is to maintain a backup offsite without using much local disk space. The normal usage is to append the suffix at the remote site and delete it locally, keeping only the much smaller index. For example:

    zpaq add part files -index index.zpaq
    cat part.zpaq >> remote.zpaq
    rm part.zpaq
indexfile has no default extension. However, with a .zpaq extension it can be listed to show the contents of the remote archive or compare with local files. It cannot be extracted or updated as a regular archive. Thus, the following should produce identical output:

    zpaq list remote.zpaq
    zpaq list index.zpaq
If archive is multi-part (contains * or ?), then zpaq will substitute a part number equal to 1 plus the number of previous updates. The parts may then be accessed as a multi-part archive without appending or renaming.

With add, it is an error if the archive to be created already exists, or if indexfile is a regular archive. -index cannot be used with -until or a streaming archive -method s.... With extract, it is an error if indexfile exists and -force is not used to overwrite.

It is possible to combine (in compressing) with something like

zpaqfranz a "y:\urketto_???.zpaq" c:\zpaqfranz  -index z:\indice.zpaq

In this case you can move the "y:\urketto_???.zpaq" somewhere (ex. USB disk).
Only the z:\indice.zpaq is required (locally) to make another run.

In fact it is possible to automate the moving with something like

zpaqfranz n y:\urketto*.zpaq -kill -exec movethefile.bat

(n is the decimation command; -kill do a wet run, -exec something => execute something with the LAST file passig as parameter %1)

And into "movethefile.bat" something like

move %1 e:\

(in this example e:\ is the USB drive) Quite complicated and, personally, I don't recommend it.

For use with removable devices (therefore of sufficient capacity) the best choice is to use a multipart (simple) archive with a local index (if the USB disk is slow); otherwise (if there is enough local space) use even rsync --append, to copy to USB only the modified portions of the single file. The sum () command is specially designed to allow you to compare the original files, just to be sure

fcorbelli commented 3 years ago

I build a release with two now switches

https://github.com/fcorbelli/zpaqfranz/releases/tag/52.15

in add()

-exec_ok something.bat (launch a script with archivename as parameter)
-copy somewhere (write a second copy of the archive into somewhere)

Add 2nd copy to USB drive (U):       a "z:\f_???.zpaq" c:\nz\ -copy u:\usb
Launch pippo.bat after OK:           a "z:\g_???.zpaq" c:\nz\ -exec_ok u:\pippo.bat

These switches are designed to make a second copy of the archives (typically multipart) on removable media (eg USB). By default, a check is performed after copying. This slows down, but is safer with potentially less reliable media.

Into exec_ok you can put a DELETE of the main archive, because zpaqfranz cowardly make only a copy, not a move

If the archive is not a multipart (aka a single ZPAQ) it will copied entirely on the 2nd media, and this can take a long time. In future a -append switch will make things much faster (just like rsync --append). For safety reason now requires a free space largest of the file to be written. In future, with -append, will become ... smart

It is not actually what was requested, but an evolution that came to my mind when reading your question, so thanks for the comment

SampsonF commented 3 years ago

I did not make myself clear, sorry!

What I mean by removable disk a harddisk, connected to the computer in hotplug way - either USB, eSATA, or even SATA. Due to port limitation, only one drive can be connected at one time.

I remember when using Zip (or pkZIP), when doing a multi-volume archive to floppy disk target, it will write to about 1.4MB to the floppy, pause, allow me to change disk, and continue.

Not sure if zpaqfranz can do similar.

These two new parameters are helpful.

Thank you very much! I will download 52.15 and try it out.

fcorbelli commented 3 years ago

You explained yourself very well. This behavior, i.e. writing a fixed number of bytes at a time, is common for 7z, rar etc.

This is possible (trivializing) because a streaming format is used: essentially it compresses the data and writes them to output one byte at a time. It is certainly possible to implement such a behavior in zpaqfranz, but it requires considerable work, as the parallel compression mechanism (on the various threads) does not allow you to immediately understand how much data will be written.

If the backup device has enough space (I would say optimal situation) there is no reason not to use the normal zpaq format, that is a single file, simple and functional.

I make the multivolume (in the meaning of ZPAQ, therefore NOT as blocks of a fixed size, like zip or whatever) for a different problem, that is the copy of the copy, normally on a NAS or - via rsync on ssh - on a remote server.

Normally in a server you will have a data disk (let's say master, disk 0 or C:) and one or more internal backup hard disks (let's say slaves, disk 1, 2, 3...D:), one or more NAS, one or more WAN copies (for disaster recovery)

Therefore a standard copy procedure is to copy the data from the master folder (of disk 0/ C:) to the slave/backup folder (of disk 1/D:). If disk 0 fails, you will get the copies from disk 1. Of course the r command makes more than one slave at time :)

zpaqfranz a d:\slave\mybackup.zpaq c:\myfileserver c:\myseconddatafolder c:\whatever

This creates a mybackup.zpaq file which is normally very large (hundreds of GB, even TB), Suppose 500GB, made by 500 versions of a 400GB original size (just an example), made 1 for day for 500 days.

If I now want to make an ADDITIONAL copy (copy-of-the-copy), I can just do something like this

robocopy d:\slave \\mynas\secondbackup /mir
or
zpaqfranz r d:\slave \\mynas\secondbackup -all -kill -verify

or copy, or cp, or rsync, or whatever

Well coping the 500GB zpaq file takes time, even lots of time via LAN.

And, even running rsync, two 500GB (slow) reads are mandatory (one for the master file, one for the second, then the diff, then the sending of the diff). If you copy to a slow USB drive, this will take time too

If you do a multipart zpaq archive instead, for example

zpaqfranz a "d:\slave\mybackup_????.zpaq" c:\myfileserver c:\myseconddatafolder c:\whatever

you will get something like that: a first 400GB file, then 499 200MB-long files (sometimes 1MB, sometimes 4GB each, it doesn't matter)

Now, if you daily robocopy, rsync or whatever the d:\slave folder to anything else, you will copy ONLY 0,2GB (the last run), NOT the entire 500GB .zpaq, because the my_backup_0001... 0499 was copied... yersterday, the day before, ... 499 days ago.

The copy will therefore be essentially immediate, even with rsync over ssh (a cloud backup).

You can achieve the same result (if you have rsync, like Linux, BSD etc) with a single .zpaq file and the --append switch. On the last run you will get a 500GB-mybackup.zpaq. The next day it will grow (in our example) to 500GB+0,2= 500,2GB, right? Right.
But the first 500GB are untouched

So rsync --append will send ONLY the last 0.2GB, WITHOUT checking (re-reading) 500,2GB (source) and 500GB (destination), in seconds, even over LAN (minutes over Internet).

So better multipart or rsync? It depends. On Windows rsync does not exist (in the default configuration), while robocopy does (and so do zpaqfranz).

On the other hand you can use zpaqfranz itself to operate as robocopy (it's the r command). Short version: on Windows I typically use multipart-robocopy-archived.

On Unix (or Linux-based NAS) single giant ZPAQ copied by rsync and CHECKED with zpaqfranz (just to be sure no problem during transfer).

Why a single giant file, whenever possible? Because it is WAY simpler to manage, archive, even delete.

Splitting a backup into equal parts no longer makes sense (for me), since floppies, DVDs, ZIPs and JAZs are gone.
If you lose (say) the mycopy_0003.zip you are in serious trouble (f*ed in fact).

I hope you understand the philosophy behind it: I developed it to gradually solve the problems of a mid-enterprise storage manager... me :)

SampsonF commented 3 years ago

Thank you very much for your detailed explanations.

I understand the reasons behind and multithread do has its challenges.

I use Fedora mainly. I discovered zpaq, find out it is not maintained any more, then discovered your folk.

I am will try to compile zpaqfranz tomorrow.

Just now, I finish creating my first archive, using zpaq, using zpaq a "archive???" -index index.zpaq

$ zpaq l index.zpaq 2234954.704464 MB of 2234954.704464 MB (450193 files) shown -> 1918050.446929 MB (32656073 refs to 27854941 of 27854941 frags) after dedupe -> 1905364.774782 MB compressed. Note: 0 of 1904556079735 compressed bytes are in archive 27.230 seconds (all OK)

Luckily, this just fit to my 1.8TiB drive. (with some reserved used on ext4)

How can I have better compression for photos ?

fcorbelli commented 3 years ago

In fact... you cannot. You cannot compress already compressed files, like JPG, MP4 and so on. To create an archive (in this step) I suggest do NOT multipart.

So a straight

zpaq a mynewbackup.zpaq /home/myphoto

To increase the compression you can use -mX, with X (please note NO space) 0=no compress at all, only deduplication (good for video) 1=default, fast (I suggest for everything) 2=compress slower than 1, but extract just as fast 3=compress better 4=compress much better 5=placebo level

Note than anything different from default, or -m0, will take lot of time, lot of RAM, lot of CPU, for minimal gain

the zpaq "power" is into snapshot archiving, it is NOT the faster or the smaller compressor. Using lzturbo you can easily find a faster one.

But you cannot use lzturbo or whatever to retain a history of your files into a single, manageable, archive

PS to compile zpaqfranz you need g++ or clang installed, it is not hard. After that a single line do everything, or you can use the generic Makefile (good for Linux)

For any problem I am here.

SampsonF commented 3 years ago

I am ready to redo my backup with zpaqfranz now.

What parameter will you recommend to use? The target disk is an empty 1.8TiB disk and the data set is about 2.03TiB with about 450k files of jpg, mp4, mov, etc.

After created the first backup with Disk1, can I keep doing backups with Disk2 until it is full?

fcorbelli commented 3 years ago

Parameters? Nothing is required, default is just fine. For the second disk, in fact, there is not an easy way.
Remember: already compressed (jpg, mp4, mov etc) cannot be reduced in size.
My advise: try and see how big the backup will become

SampsonF commented 3 years ago

The backup completed. But I forgot to use the -index parameter .

Is it possible to generate the local index after the backup is done?

fcorbelli commented 3 years ago

Mmmmhhh... No, without heavy development, sorry

fcorbelli commented 3 years ago

I think I can close