Tarsnap / tarsnap

Command-line client code for Tarsnap.
https://tarsnap.com
Other
864 stars 60 forks source link

--include option #581

Open ashishdamania opened 1 year ago

ashishdamania commented 1 year ago

Hello, This may be very trivial but I am trying to figure out how to use include only option for tarsnap?

tarsnap --dry-run --no-default-config --print-stats --include="*.pdf*"  -c  /Users/xyz

If I run this, I get a warning that "Archive contains no files" and it does not seem to work

tarsnap: Warning: Archive contains no files Total size Compressed size All archives 1.5 kB 1.4 kB (unique data) 1.5 kB 1.4 kB This archive 1.5 kB 1.4 kB New data 1.5 kB 1.4 kB

However, this seems to work.

tarsnap --dry-run --no-default-config --print-stats --exclude="*.pdf*"  -c  /Users/xyz

tarsnap: Removing leading '/' from member names Total size Compressed size All archives 8.4 MB 3.4 MB (unique data) 8.4 MB 3.4 MB This archive 8.4 MB 3.4 MB New data 8.4 MB 3.4 MB

Am I missing anything? This is my tar version:

bsdtar 3.5.3 - libarchive 3.5.3 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.8

Thanks for making Tarsnap.

gperciva commented 1 year ago

Do you have pdf files directly inside /Users/xyz?

Unfortunately, the behaviour of --include is inherited from the ancient (1979!) tar(1) command. It's easier to think of --include as --include-only, meaning "only include file and directories whose name matches this".

https://www.tarsnap.com/selecting-files.html

So if you have

/usrs/xyz/my-docs/foo.pdf

then tarsnap notices that my-docs/ does not match *.pdf, so it doesn't look inside my-docs/

ashishdamania commented 1 year ago

Yes. It does have pdf files but nested within folder. So I guess just pass the files filtered through "find" command. Thanks!

On Thu, Aug 31, 2023 at 7:29 PM Graham Percival @.***> wrote:

Do you have pdf files directly inside /Users/xyz?

Unfortunately, the behaviour of --include is inherited from the ancient (1979!) tar(1) command. It's easier to think of --include as --include-only, meaning "only include file and directories whose name matches this".

https://www.tarsnap.com/selecting-files.html

So if you have

/usrs/xyz/my-docs/foo.pdf

then tarsnap notices that my-docs/ does not match *.pdf, so it doesn't look inside my-docs/

— Reply to this email directly, view it on GitHub https://github.com/Tarsnap/tarsnap/issues/581#issuecomment-1701947078, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXDXUEW5G35NEUEXPE3TULXYET7TANCNFSM6AAAAAA4GZQOOA . You are receiving this because you authored the thread.Message ID: @.***>

gperciva commented 1 year ago

Yes, passing filenames via find is a great way to avoid this problem!

ashishdamania commented 1 year ago

Just for future reference, I ended up using find command as follow:

find /Users/xyz -type f -name '*docx' -o -name "*.pdf" -print0 | xargs -0 tarsnap --dry-run --no-default-config --print-stats --humanize-numbers -c

This will only include files with .docx and .pdf extension. May be there is a better and easier way to achieve this goal.

gperciva commented 1 year ago

That's one workable option!

I personally would do something like this (untested):

find /Users/xyz  -type f -name '*docx' -o -name "*.pdf" > ~/my-files-list.txt
tarsnap -c -T myfiles.txt
rm ~/my-files-list.txt

~but I can't offhand think of any reason to prefer that method over yours.~

The important thing is the tarsnap -T option.

(This should handle filenames with spaces, but not filenames which contain newlines. tarsnap -T is documented as being able to work with --null so it should be able to handle such filenames, but I'd want to double-check that before relying on it.)

EDIT: if you have a lot of filenames, you might run out of room for the command-line arguments, which would result in missing files from your archives! For that reason, I recommend this method, instead of find | xargs directly.

gperciva commented 1 year ago

Whoops, sorry, I was wrong about xargs. Ignore my previous message. (I'll edit it on github.)

ashishdamania commented 1 year ago

Actually, I am getting this output with my naive xargs one liner:

tarsnap: Argument vector exceeds 128 kB in length; vector stored in archive is being truncated.
tarsnap: Removing leading '/' from member names
                                       Total size  Compressed size
All archives                               823 MB           724 MB
  (unique data)                            793 MB           698 MB
This archive                               823 MB           724 MB
New data                                   793 MB           698 MB

I am concerned about that Argument vector exceeds 128 kB in length message.

gperciva commented 1 year ago

Right; that'll happen if you have a lot of files. Your archive does not contain all of your pdf or docx files!

Writing the list of filenames to a file would avoid that problem.

cperciva commented 1 year ago

I'm 99% confident that in the case above the archive does contain all of the files -- if there were too many to fit into a command line, xargs would run multiple tarsnap processes (of which all but the first would fail with a "archive already exists" error since all of the tarsnap processes would use the same archive name).

But yes, putting the complete list of files into the command line is a bit of a weird way of doing this; much better to use the -T option.

cperciva commented 1 year ago

FWIW I would say that the safest "most unixy" way of doing this is "find ... -print0 | tarsnap -c --null -T- ...`.

gperciva commented 1 year ago

Oh right, that's a tarsnap warning, not a bash warning. So the result is if you ran

tarsnap --list-archives -vv

(which prints out the command-line used to generate each archive), then that archive wouldn't print the right thing.

But that's another reason to go with writing to a list of files; --list-archives -vv is going to be un-readable if you have tons of filenames in there.

gperciva commented 1 year ago

I'll revisit this tomorrow morning and look at something to add to the webpage, along with reasons why other methods might not be ideal.