[Feature Request] Add zstd compression

tiziodcaio commented 2 years ago

It is possible to add the zstd or generally more compressions algorithms?

machiav3lli commented 2 years ago

This should be possible, but the "why" question is what would interest me.

tiziodcaio commented 2 years ago

Ahhah I asked it if is not too complicated I'm not expert but gzip has not actually the best compression ratio and it can be cool to use others. I don't if you are not intersted it was only a little idea 😅

opusforlife2 commented 2 years ago

Is compression/decompression speed a bottleneck for the app? If so, will it benefit from using zstd?

tiziodcaio commented 2 years ago

The clue is that is not a problem really, it was a capricious feature request

tiziodcaio commented 2 years ago

If you think is only a dumb idea, no problem saying it. I will close the issue without esitations ;-)

hg42 commented 2 years ago

Most users will prefer fast compression (e.g. gzip level 1) over an optimal but slow compression ratio. Though, this could be interesting for speed, if the phone can process the compression faster than the reduced baytes need to crawl to a slow storage (e.g. if using an rclone mount or a slow sdcard and also for syncing). However, the improved ratio is usually not that big...

I would say it depends on the available libraries, adding them might not be too complicated, if they are similar in usage to the gzip lib.

Leojc commented 2 years ago

Looking forward to this feature! zstd is really fast and has a better compression ratio. So it's totally worth it to replace gzip. Also lz4 is very very fast for decompression, which could be a nice alternative. Lz4 could be about 7x faster than gzip as for decompression i think.

tiziodcaio commented 1 year ago

I've found this apps that implement lzo and zstd compression for backups: https://github.com/XayahSuSuSu/Android-DataBackup

tsiflimagas commented 1 year ago

This comment from Nils is very interesting. It seems that adding support for more compression algorithms could be much more straightforward since it would use a dependency that's already in the app.

hg42 commented 1 year ago

I've found this apps that implement lzo and zstd compression for backups: https://github.com/XayahSuSuSu/Android-DataBackup

I didn't look into the source, but DataBackup uses a script from someone else. I guess the compression is done via command line while NB uses a library.

Currently you can still switch between complete API/library based archiving (including tar, we call it tarapi, vs tarcmd which uses the tar command).

The compression could be moved to the command line.

However, it would not make sense to do tarapi -> compression command > encryption via api

so switching between tarapi and tarcmd would be dropped, which also removes compatibility for older backups (or the old routines still use the api versions)

MrEngineerMind commented 1 year ago

Yeah, not being able to restore an older backup because the compression changed would not be a good thing.

opusforlife2 commented 1 year ago

Major releases have often broken compatibility. It wouldn't be anything new. To know if space savings are worth it, someone needs to test zstd.

MrEngineerMind commented 1 year ago

Using "Often" is not very accurate, and even if it was, this type of "compatibility" issue is kind of unique for this type of app.

Imagine if adobe photoshop came out with a new version in which you can no longer load legacy JPEG/BMP image files into it anymore because they are an old format.

opusforlife2 commented 1 year ago

Apples and oranges. Apps get updated, so your old backups will eventually become irrelevant.

MrEngineerMind commented 1 year ago

"Eventually...."

But what you are proposing is not a gradual thing...as soon as a new version of NB app switches to a different compression, ALL previous backups are unusable, which is not a good thing.

opusforlife2 commented 1 year ago

Sure, but if maintaining compatibility hinders improvements, then it isn't worth it. The deciding factor here will likely be how beneficial zstd actually is in practice.

MrEngineerMind commented 1 year ago

The only practical solution is to allow the user to specify the compression method when doing a backup, and have NB auto-detect the compression method of a backup when restoring.

opusforlife2 commented 1 year ago

Not that practical when you consider that the entire maintenance burden for the alternative code paths is on the developer(s). Totally up to them if they want to spend the time and effort for that.

MrEngineerMind commented 1 year ago

Being a programmer, I know there would be one-time effort to add support for a second compression method. But that effort would be a fraction of the effort required to do your request of switching to a totally different compression method.

And, once that work is completed, there will be little need to change it for any future versions of NB.

opusforlife2 commented 1 year ago

Neither am I a developer, nor is this my feature request. I'm merely going by what I understand from https://github.com/NeoApplications/Neo-Backup/issues/444#issuecomment-1412473698.

MrEngineerMind commented 1 year ago

And that #444 comment is saying that it would be a significant effort to modify NB to use that other compression method.

hg42 commented 1 year ago

The only practical solution is to allow the user to specify the compression method when doing a backup, and have NB auto-detect the compression method of a backup when restoring.

yes, that's the natural thing to do

Not that practical when you consider that the entire maintenance burden for the alternative code paths is on the developer(s). Totally up to them if they want to spend the time and effort for that.

right, it needs some restructuring, that's the biggest reason for now. And there are more important things to do. The advantage is too low compared to other things.

The maintenance is not really a problem, when the libs are matured (they just work) and the code is modularized. Up to now compression is integrated with some conditionals and this is not the way to go, if we have multiple compression methods. E.g. the autodetection doesn't exist, it's "if compressed use .gz and do gzip". This needs to use the stored compression method or the file extension instead. And what to do, if they do not match? A user could recompress it with a different algorithm, which is kind of reasonable from my POV. The usual (so called professional) approach is to forbid users to manipulate the managed data, but I would like support that (with obvious limitations). Especially for backups, there should be robust strategies. As a conclusion, I would prefer the file extension. This has to be discussed between developers.

hg42 commented 1 year ago

Being a programmer

nice, so you could add it? :-)

I know there would be one-time effort to add support for a second compression method

correct, it basically making it modular

supporting more methods would be simple (given a library that supports the same Interface or at least similar enough, in this case it needs to stream the data)

And, once that work is completed, there will be little need to change it for any future versions of NB

right, here the maturity kicks in. If the lib isn't mature and has bugs, like crashing for certain data, it would create maintenance and even worse, the backup could be unusable despite it was compressed successfully.

This means, I would not like to add any new algorithm, until I have a feeling, that it's really ready for important data. Note, that many purposes for compression are not mission critical. But backup is (at least from my POV). (Note, I'm not the main developer)

MrEngineerMind commented 1 year ago

"This means, I would not like to add any new algorithm, until I have a feeling, that it's really ready for important data."

I agree.

hg42 commented 1 year ago

the #444 comment is about a command line tools solution

I wanted to say that even if it's easy for DataBackup, it's another case for NB.

some of my thoughts:

adding compression to the tar.sh script would indeed be easy
using a shell executable would either need to add this to NB
up to now we rejected everything that needs binaries, that are not part of toybox
using an external executable (e.g. from Termux) would be possible
preferences for the compress/decompress commands would be enough
autodetection would still be necessary
how to handle it, when the user would change those commands, but there are old backups created with other commands?
handling all this would be much easier if we had plugins
because plugins are definitely on my plan (but a rather big effort), I tend to postpone all things that could be solved with plugins, at least if the work would be thrown away
modularization would also help the plugins, so this work would still be useful

hg42 commented 1 year ago

anyone interested in this, you could also try some things to prove, if the advantage is as big as you hope:

the stream is root file system -> tar -> compression -> encryption -> SAF storage
disable compression and compare the speed and size
do this for different compression levels, less data also means more speed on SAF storage (or even remote)
note, any comparison on local file systems isn't helpful. The comparision needs to take SAF into account
that means faster compression doesn't help much, if the stream is slow at the end

for wizards:

you can simulate part of the stream by using rclone to mount e.g. an sdcard directory on internal storage, then use tar to compress a directory and stream it though a compressor command and to a file on that mount
report the commands, the execution times,the sizes, and the rclone mounts
you may also try remote directories (ssh, gdrive, etc.)

hg42 commented 1 year ago

"This means, I would not like to add any new algorithm, until I have a feeling, that it's really ready for important data."

at least, this would be valid for a built in compression method, it is like a suggestion. A configurable command line or a plugin, that is from an external repository etc. would be their own decision.

murlakatamenka commented 6 months ago

This should be possible, but the "why" question is what would interest me.

It beats old gzip bacisally everywhere (compression/decompression speed/time, compression ratio, linear scaling of compression depending on selected level).

https://morotti.github.io/lzbench-web -> section Evolution:

We're in the 3rd millenimum and there was surprisingly little progress in general compression in the past decades. deflate, lzma and lzo are from the 90's, the origin of lz compression traces back to at least the 70's.

Actually, it's not true that nothing happened. Google and Facebook have people working on compression, they have a lot of data and a ton to gain by shaving off a few percents here and there.

Facebook in particular has hired the top compression research scientist and rolled 2 compressors based on a novel compression approach that is doing wonder. That could very well be the biggest advance in computing in the last decade.

See zstd (medium) and lz4 (fast):

zstd blows deflate out of the water, achieving a better compression ratio than gzip while being multiple times faster to compress.

lz4 blows lzo and google snappy by all metrics, by a fair margin.

Better yet, they come with a wide range of compression levels that can adjust speed/ratio almost linearly. The slower end pushes against the other slow algorithms, while the fast end pushes against the other faster algorithms. It's incredibly friendly as a developer or a user. All it takes is a single algorithm to support (zstd) with a single tunable setting (1 to 20) and it's possible to accurately tradeoff speed for compression. It's unprecedented.

Of course one could say that gzip already offerred tunable compression levels (1-9) however it doesn't cover a remotely comparable range of speed/ratio. Not to mention that the upper half is hardly useful, it's already slow and making it slower for little benefit.

This is from bench made 6 years ago (!), and it surely improved since that time.

zstd is in the Linux kernel, and kernel devs don't do random crap into it

zstd can be used to compress the kernel itself and its modules, zram/zswap, for transparent filesystem compression (BTRFS)

This alone is the ultimate seal of approval, Linux kernel is no joke

It's basically everywhere now?

See https://en.wikipedia.org/wiki/Zstd#Usage

A few examples:
- Arch Linux compresses its packages with zstd since 2019
- Fedora uses zstd for their RPM
Repos of Linux packages are no joke too, this scenario is pretty similar to backup one:
- fast compression/decompression -> job done faster, happy user, less battery used
- smaller size -> less storage/bandwidth used

The fact that this issues is opened for 2 years is quite disappointing imo.

murlakatamenka commented 6 months ago

As for implementation, I see that compression library used common-compress supports zstd, so I don't really get why it should be hard to use another compressor:

tar | gzip | encryption -> org.jdoe.app.tar.gz -> tar | zstd | encryption -> org.jdoe.app.tar.zst

Add compression + level

enum Compression {
    Gzip(u8)
    Zstd(u8)
}

and expose those via GUI, which at the moment exposes only level because there is only gzip. Voilà? While simplified, isn't it right?

@hg42

autodetection would still be necessary

that's not a problem, zstd-compressed stream has its own magic number, see https://datatracker.ietf.org/doc/html/rfc8878#section-3.1.1-3.2

Magic_Number: 4 bytes, little-endian format. Value: 0xFD2FB528

how to handle it, when the user would change those commands, but there are old backups created with other commands?

I'd say this is overthinking. If a user is smart and advanced, he either won't do it, or won't recompress tar.gz into tar.xz and will expect NB to work just like nothing happened.

hg42 commented 6 months ago

It beats old gzip bacisally everywhere (compression/decompression speed/time, compression ratio, linear scaling of compression depending on selected level).

the memory consumption would be an important part. Speed often comes from using more memory.

there are several backups running in parallel
phones have less memory than workstations and servers

hg42 commented 6 months ago

The fact that this issues is opened for 2 years is quite disappointing imo.

pull requests are probably welcome, when well done

hg42 commented 6 months ago

autodetection would still be necessary that's not a problem

in the sense that it's solved, yes, but I didn't say it's a problem, I meant, it's work to be done.

All I do, is gathering info, to have a picture how much work it is in relation to the gain.

What to do with the info, is a thing of priority lists, or enthusiastic developers creating pull requests.

From my current POV it's work, that some other developer can do, exactly because it's fairly simple. I concentrate on more complicated things.

According to compression, I use gzip level 1 and other items on my personal todo list will probably help much more with my backups.

On my todo list, I prioritize items that make things possible before items that improve speed, etc. E.g., challenges introduced by new Android versions, the complications of work profiles, etc. And then I need spare time...

murlakatamenka commented 6 months ago

It beats old gzip bacisally everywhere (compression/decompression speed/time, compression ratio, linear scaling of compression depending on selected level).

the memory consumption would be an important part. Speed often comes from using more memory.
* there are several backups running in parallel

* phones have less memory than workstations and servers

True, I didn't mention memory usage anywhere, but again, zstd is flexible and allows controlling both memory and threads.

For example, memory tunable mainly via windowLog

For threads there are -T/--threads and --single-thread arguments, the latter is specifically tailored for low-memory scenarios:

--single-thread: Use a single thread for both I/O and compression. As compression is serialized with I/O, this can be slightly slower. Single-thread mode features significantly lower memory usage, which can be useful for systems with limited amount of memory, such as 32-bit systems.

So both options are available: several parallel backups with single thread per worker and sequential backup where single worker uses multiple threads to compress faster; both with respect to a memory limit.

hg42 commented 6 months ago

I thought of some benchmark on a mid range phone :-)

But at the end it doesn't matter, it's on a todo list but not priority

hg42 commented 6 months ago

For threads there are -T/--threads and --single-thread arguments, the latter is specifically tailored for low-memory scenarios

this doesn't help, because multiple backups are running in parallel, we would probably use single-thread, anyways

The main question is how much memory zstd uses. Benchmarks for a single compression don't matter, if the memory consumption is high and it starts to invalidate pages. So a fast zstd with high memory consumption could be (maybe much) slower than a slower gzip with low memory demands.

hg42 commented 6 months ago

and if memory consumption would be limited by an option, would it still be faster?

FrozzDay commented 6 months ago

I don't have any scientific tests, but my phone, which has SD439 and 4GB of RAM, can handle the tar --zstd operation in 13 seconds and the tar --gzip operation in 40 seconds.

hg42 commented 6 months ago

it's obvious, that a single compression works faster. But as I said, speed AND compression ratio are both a trade with memory consumption.

So, to compare, you need to start 8 (or the number of cores) archiving processes at once for each.

The result could be totally different.

hg42 commented 6 months ago

this tells us, that zstd should also be good for multithreading:

https://engineering.fb.com/2016/08/31/core-infra/smaller-and-faster-data-compression-with-zstandard/

unless you choose a higher level:

https://jothiprasath.com/blog/gzip-vs-zstd/ see level 19

the meta document tells us, that it can also use terabytes. Looks like level 3 is a good balance. (well, I already hear people complaining about never ending backups, if they do not keep the default)

FrozzDay commented 5 months ago

I tried adding zstd in this app, I think it's quite stable. here is the repo NeoBackup + zstd

hg42 commented 5 months ago

I tried adding zstd in this app, I think it's quite stable. here is the repo NeoBackup + zstd

from a quick check, this looks good. I did not test, yet...

machiav3lli commented 5 months ago

@FrozzDay I guess you can already open a PR. Thanks!

hg42 commented 5 months ago

I will create a pumpkin apk soon...

as far as I see compressionLevel is still 0...9, I guess that's ok for now... given that higher levels are slow I'm currently testing some levels (in debug version, which I think shouldn't matter much, at least the comparison is with the same conditions).

zstd level 1 created 15 GB in 10 min (on empty directory) zstd level 3 created 15 GB in 13 min (rewriting existing backups) zstd level 9 created 14.9 GB in 15 min (rewriting existing backups) gzip level 1 created 15.3 GB in 16 min (on empty directory) gzip level 2 created 15.3 GB in 16 min (rewriting existing backups)

quite interesting...

hg42 commented 5 months ago

@FrozzDay

the pref does not open a menu, but that's something to be fixed in our code, so just create the PR, I'll fix that later (code already exists)

hg42 commented 5 months ago

I also think gz / zstd should be gzip and zstd (or gz and zst, I would prefer the first), so there is type, extension

EDIT:

I see, "gz" was used before in the properties file, so @machiav3lli what should we do? I think we should definitely maintain compatibility

use compressionType = if (compressionType == "gz") "gzip" else compressionType on Backup constructor (is it enough? [EDIT: no it is not])

or be pragmatic, using file extension? "gz" and "zst" ? but if more types are added...

EDIT:

I'm using "gz" and "zst"

hg42 commented 5 months ago

here my test apk:

https://t.me/neo_backup/53944/56653

FrozzDay commented 5 months ago

or be pragmatic, using file extension? "gz" and "zst" ? but if more types are added...

i think that should be enough? as each compression supported by Commons Compress has their own unique format

hg42 commented 5 months ago

my thought was, that gzip/zstd/whatelse would be more clear than gz, zst.

The names derived from the file extensions are not really standard, e.g. Windows usually uses 3 letters, while others often use more e.g. .zstd is also common.

But I decided to take the short name for now.

At some point compression (and other output transformations) should probably be moved into modules. Then alternative names can be registered by those modules.

hg42 commented 3 months ago

I'll close this, because it's implemented by @FrozzDay (thanks again) in #856 .

I'm using it myself, it seems to work well and the code is clean. Be careful with the compression levels. I think higher levels take much more memory and then it probably slows down. You should probably find the sweet spot for your situation. Though, with lower levels it's a no-brainer.

opusforlife2 commented 3 months ago

Online sources seem to indicate that the default level of 3 is best. Example: https://docs.aws.amazon.com/athena/latest/ug/compression-support-zstd-levels.html

NeoApplications / Neo-Backup

[Feature Request] Add zstd compression #444