boludoz / lz4

Automatically exported from code.google.com/p/lz4
Other
0 stars 0 forks source link

LZ4 cli invocation should match other compressors for non-option arguments #151

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
The lz4 command from lz4cli.c uses non-option args 1 and 2 as *input_filename 
and *output_filename.  While you can leave output_filename blank and have it 
created as an strcpy of the input_filename + LZ4_EXTENSION, it doesn't allow 
you to send multiple input files at a time.

What is the expected output? What do you see instead?
Expect:  cli program (lz4) should take all non-option arguments as input files 
to be processed into .lz4 files, at a minimum.
Expect:  cli program (lz4) to remove input_filename sources like gzip, bzip2, 
and xz, with an option to skip this (-k, for keep).

What version of the product are you using? On what operating system?
Tested on ubuntu 14.04 LTS with liblz4-tools and I compiled the latest source 
(r127 I believe).  Both lz4 and the older lz4c do not have the expected 
behavior.

Please provide any additional information below.
The two primary purposes of this are:
1.  Consistency with other compressors.
2.  Parallelization is more difficult and more expensive.  A new process must 
be spawned for each file to compress.  Other tools (even those without a 
recursive switch) can be fed multiple files via find/xargs/gnu-parallel.

Thanks for your consideration.  This software and the CLI tool are very nice.

Original issue reported on code.google.com by KyleJHar...@gmail.com on 22 Jan 2015 at 3:56

GoogleCodeExporter commented 9 years ago

Original comment by yann.col...@gmail.com on 22 Jan 2015 at 4:05

GoogleCodeExporter commented 9 years ago
Hello Kyle

Indeed, the first behavior you are observing is considered "correct", which 
means as intended by the developer.

"inFile1 outFile2" behavior is so entrenched within lz4 command line tool that 
it will difficult to modify that choice now without creating some massive 
problems for installed user base.
Should a "multiple input file" process exist, it will need its own command, to 
ensure compatibility with existing scripts.
And since one command must be added, it will not work "as is".

On top of that, multiple input files is a deliberately avoided feature so far,
because it introduces a huge corpus of File Systems related issues to handle. 
And this a workload not be underestimated.

As a consequence, the preferred way to handle "multiple input files" scenario 
is to couple lz4 with tar, using pipe. The advantage is that tar is a well 
known and well supported aggregation tool, compatible with a large base of file 
systems.
That's how I handle such use case myself, and how I recommend it to everyone.

I believe it would answer your use case too, even it if makes the line of 
script slightly longer. For some useful examples (should you ever need it) : 
http://www.computerhope.com/unix/utar.htm

Original comment by yann.col...@gmail.com on 22 Jan 2015 at 4:30

GoogleCodeExporter commented 9 years ago
> Expect:  cli program (lz4) to remove input_filename sources like gzip, bzip2, 
and xz, with an option to skip this (-k, for keep).

The answer to this second behavior is a bit different.
We had some discussions about it quite some time ago, when settling lz4 command 
line behavior.

The conclusion was that the "delete by default" behavior was probably a good 
design choice back in the days when storage space was scarce and expensive. But 
it is no longer appropriate today.
This is scary to imagine an administrator could lose some file just because he 
forgot to add "-k" to his script or his command line. And the too-typical 
answer "RTFM" translated "it's his fault, he should have paid attention" is no 
longer acceptable either.

So it was a deliberate choice to keep original file as default.
I understand it makes lz4 behavior slightly different from gzip. But I really 
believe it is the right thing to do today.

Deleting a file is a separate command line operation. If there was some 
standard command switch to delete a file after compression, lz4 would have 
supported it, but there is none, as far as I know. So it's not necessary to 
make the list of commands longer.

Original comment by yann.col...@gmail.com on 22 Jan 2015 at 4:45

GoogleCodeExporter commented 9 years ago
Hello.  Thanks for the quick updates.

On comment #2:  I agree it would probably be a new switch or a different cli 
program altogether to allow the existing user base to maintain functionality.  
The use of tar is reasonable for cases when all files can to go into a single 
archive, but it doesn't permit the creation of multiple .lz4 files at once.  
Granted, the latter is a less-often use-case, but it does exist (which prompted 
me to file an issue heh).

Also, you mentioned there would be a large number of File Systems related 
issues with using multiple inputs.  Forgive my ignorance, but could we not copy 
the logic from, for example, existing GNU programs which already handle the 
typical: some_prog -a -b -c file1 file2 ... ?  My apologies if I'm off base, 
but it seems like we could get the file list in a similar manner and then loop 
over the existing logic near the end of main() to run the 
DEFAULT_COMPRESS/DECOMPRESS actions.

On comment #3:  Yea, having a delete-by-default action is purely subjective.  I 
see the merits and consequences.  Having the lz4 cli do it by default is 
perhaps overkill; though a switch to auto-delete would be useful.  That said, 
the lzop program doesn't do this and the world keeps spinning :)  This was a 
very minor consideration in my mind when filing the issue; it just seemed 
appropriate to bring up at the same time.

Original comment by KyleJHar...@gmail.com on 22 Jan 2015 at 5:05

GoogleCodeExporter commented 9 years ago
> it doesn't permit the creation of multiple .lz4 files at once. Granted, the 
latter is a less-often use-case, but it does exist

Good point. I did not thought about it when answering this issue.
Maybe it does deserve a switch after all...

> Also, you mentioned there would be a large number of File Systems related 
issues with using multiple inputs.  Forgive my ignorance, but could we not copy 
the logic from, for example, existing GNU programs which already handle the 
typical: some_prog -a -b -c file1 file2 ... ? 

This was related to my previous answer, when I wrongly understood your were 
willing to compress several files into a single archive.
In such case, preserving file attributes, storing filenames (cyrillic ? chinese 
? arab ? etc.), directory structure, link, etc. is a serious job.

But if the point is just to loop over multiple filenames to compress them one 
by one, then it's more trivial to implement.

Original comment by yann.col...@gmail.com on 22 Jan 2015 at 5:16

GoogleCodeExporter commented 9 years ago
Ahh, sorry for the confusion.  Yes my intention is to just loop over multiple 
files and compress them one by one.  Such that:
lz4 file1 file2 file3    become...
file1.lz4
file2.lz4
file3.lz4

The main benefits are as listed above: parallelization (find/xargs) and 
avoiding the expense of spawning (and then deconstructing) a new process for 
each file.

I actually spent a couple hours last night implementing such a loop.  It worked 
to an extent, but my knowledge of C is weak.  I kept reaching segfaults and I 
don't have knowledge on debugging C very well.  I'm pretty sure it was because 
I was misusing the *input_filename, *output_filename, and *dynNameSpace 
pointers; treating them like a char[] when trying to clear and assign the next 
filename to process.  Someone with experience writing C would have a much 
easier time than I did heh.

Original comment by KyleJHar...@gmail.com on 22 Jan 2015 at 5:32

GoogleCodeExporter commented 9 years ago
OK

Let's suppose a new switch is created for this use case, for example -M,
would that be fine for your use case ?

Original comment by yann.col...@gmail.com on 22 Jan 2015 at 5:44

GoogleCodeExporter commented 9 years ago
Absolutely.  That would give a big boost to compatibility with other 
compressors (as well as GNU-style cli programs).

I am happy to help any way I can if you need anything.  Thanks again!

Original comment by KyleJHar...@gmail.com on 22 Jan 2015 at 5:47

GoogleCodeExporter commented 9 years ago
OK, I'll handle it

Original comment by yann.col...@gmail.com on 22 Jan 2015 at 6:59

GoogleCodeExporter commented 9 years ago
Hey, I went back and looked at the code.  Not sure why it gave me such an issue 
the first time.

Anyway, I checked out a copy from git (github: Cyan4973/lz4/master) and made 
the patch.  I sent a pull request for it.  As mentioned I'm not an expert in C, 
but I believe the patch should be acceptable with little to no modification.

If it looks good, please merge and feel free to close this issue.

Thanks.

Original comment by KyleJHar...@gmail.com on 15 Feb 2015 at 5:07

GoogleCodeExporter commented 9 years ago
The dev branch at :
https://github.com/Cyan4973/lz4/tree/dev
has received significant udate with regards to issue 151.

The code do no longer depends on "VLA" (Variable Length Array),
as nicely described by Takayuki at 
https://github.com/Cyan4973/lz4/pull/52#issuecomment-76760185,
which makes it more portable.

Of importance too, the logic has been changed to be ported by LZ4IO instead.
It now relies on a new function :
int LZ4IO_compressMultipleFilenames(const char** inFileNamesTable, int 
ifntSize, const char* suffix, int compressionlevel);

which seems a better responsibility distribution.

The scheme has even been extended to benchmark, so that it can bench multiple 
files in a row (which was already possible, but would crash is a command was 
added in the middle or end of the filename list).

A remaining significant difference with gzip is that the process stops when it 
reaches an unavailable filename.
For example, trying :
./lz4 -m file1 nothere file2
it would output :
compress XX bytes into YY bytes ==> PP%
impossible to open file nothere

while gzip would output a warning message, and continue on compressing file2.

It's a matter of error vs warning status.

If you have a preference, please tell.

Original comment by yann.col...@gmail.com on 7 Mar 2015 at 12:31