E3SM-Project / zstash

Long term HPSS archiving tool for E3SM
BSD 3-Clause "New" or "Revised" License
8 stars 11 forks source link

[Bug]: zstash with --follow-symlinks clobbers the source directory #341

Closed TonyB9000 closed 1 month ago

TonyB9000 commented 1 month ago

What happened?

Given a source directory SRC with many symlinks to large files (and some actual regular files), zstash --follow-symlinks properly creates an archive with the real file content, but has clobbered the SRC directory, replacing the symlnks found there with the actual files referenced by the links.

What machine were you running on?

acme1.llnl.gov

Environment

v1.4.3

Minimal Complete Verifiable Example (MCVE)

mkdir -p src/d1 src/d2
place "large_file" in src/d1

issue: "ln -s <full_path_to>/src/d1/large_file src/d2/large_file"

ls -l  src/d2

issue:  zstash create --hpss=none --follow-symlinks --cache <path_to_new_archive> src/d2

ls -l  src/d2

Relevant log output

No response

Anything else we need to know?

It is common to create "version" directories, where original (version 1) directories have thousands of LARGE files, and subsequent version 2, version 3 directories replace a few files, by retain symlinks to the bulk of unchanged original files. Creating an archive of the latest version directory (via --follow-symlinks) should NOT cause replication of the original volumes (except into the archive produced).

forsyth2 commented 1 month ago

Update from @TonyB9000 -- The proper behavior: When zstash encounters a symlink S (with –follow-symlinks), it should use “realpath S” to locate the actual file, produce the checksum on THAT file, copy that file into the tar-archive (using the “S” given name), but not write any changes to the source directory.

forsyth2 commented 1 month ago

The lines of code that produce the hard copy can be found in https://github.com/E3SM-Project/zstash/pull/261/files#diff-7664ed168a129748a76d88615fcba9ce38e10bf3f2c4acbe89153ae80882a443, specifically:

    if follow_symlinks and os.path.islink(file_name):
        linked_file_name = os.path.realpath(file_name)
        os.remove(file_name)  # Remove symbolic link and create a hard copy
        shutil.copy(linked_file_name, file_name)

I recall the reasoning for doing a hard copy was that we'd have no way of knowing what to link to once we've moved from source to destination.

In the email you just sent, you summarized the following case, where (s) denotes a symlink:

v1      v2      v3

---     ---     ---

f1    <-(s)   <-(s)

f2      f2    <-(s)

f3      f3      f3

The problem is what if we only archived v3 and not v1? Then what would v3's symlink be pointing to?

I suppose your point though is that a hard copy should only be produced on the destination, and source should remain unchanged. I'm just not quite sure how we would implement that, since zstash by its nature archives what is well, there to archive.

TonyB9000 commented 1 month ago

Since my experience is with "local" archiving", the --follow-symlinks should ensure that the tar-files created contain the actual data, without affecting the source directories. If the tar-files are created locally, that should not be a problem.

I think "tar" itself works that way - although not tested much by me.

In a sense, we are cooperating with a fantasy. The main reason for employing symlinks (hard or soft) is to refactor the layout of something without having to duplicate volumes (real "copy"). The users of the layout should be able to remain ignorant of the actual underlying structure of things (hence "ls" does not reveal symlinks as links, but cooperates with the fantasy that the files are really there). Hence, when archiving, one may expect "--fantasy" to be in effect: Go ahead and archive the actual files, even though the directory only contained the symlinks.

TonyB9000 commented 1 month ago

Aside: Where "--cache" is specified, but NOT a fully-qualified path, I think the default behavior should be to place it in the user's current directory. By placing it in the tail of the source directory, it could interfere with some processing that is designed to "process all contents of a directory", not expecting a new file or directory to have appeared.

TonyB9000 commented 1 month ago

@forsyth2 Where you provided the code: if follow_symlinks and os.path.islink(file_name): this is clearly BEFORE the actual tarring is undertaken. Instead, at the very point where a file is being tarred, there should be a tar-module option to use "os.path.realpath(file_name)".

Another concern is that without "--follow-symlinks", if one creates an archive of all the "version" directories together - those links are fully-qualified (reference the parent file-system), and I'm not sure the links will still work when the archive is opened at a new destination. More stuff to test.

forsyth2 commented 1 month ago

Aside: Where "--cache" is specified

I created #344 to track that.

the --follow-symlinks should ensure that the tar-files created contain the actual data, without affecting the source directories

I think I may have that working. See https://github.com/E3SM-Project/zstash/pull/343#issuecomment-2266190845. Notably, it seems like just using tar = tarfile.open(mode="w", fileobj=tarFileObject, dereference=follow_symlinks) may do what we need. Does the script output in that comment look good to you?

I'm not sure the links will still work when the archive is opened at a new destination.

I think that was the whole point of #261 in the first place -- that there was no point archiving symlinks because the places they were pointing to didn't exist at the destination.