HumanCellAtlas / dcp-cli

DEPRECATED - HCA Data Coordination Platform Command Line Interface
https://hca.readthedocs.io/
MIT License
6 stars 8 forks source link

Hard-link creation fails when downloading to NFS mounts #519

Closed fangpingmu closed 3 years ago

fangpingmu commented 4 years ago

I am using hca dss to download data. hca dss download-manifest --manifest 1M_Neurons.tsv \ --replica 'aws' \ --layout bundle

There are lots of hard link problems.

File "/home/user001/.local/lib/python3.7/site-packages/hca/dss/init.py", line 584, in _download_and_link_to_filestore hardlink(file_store_path, file_path) File "/home/user001/.local/lib/python3.7/site-packages/hca/dss/util/init.py", line 50, in hardlink os.link(source, link_name) PermissionError: [Errno 1] Operation not permitted: '.hca/v2/files_2_4/8f/aec6/8faec66817969ae6f847b0c649e7328af6085d88e71b32b0e3a8284df4cd88f7' -> '33855cf6-6f3e-4b8f-9cb4-c2b2ea9f528d.2019-05-16T211813.099000Z/dissociation_protocol_0.json'

My file systems do not support hard link across directories. Is there an option to use soft link?

Why is the data downloaded to .hca, and then link to the target directory using hardlink? Can hca download directly to the target directory or move the files from .hca to the target directory?

theathorn commented 4 years ago

@fangpingmu What type of file system is the current working directory on when running this? An NFS share or local file system? Any other relevant info such as OS version, etc. would be helpful.

fangpingmu commented 4 years ago

They are NFS file systems. We have multiple NFS file systems, and I have tested on BeeGFS and ZFS. These NFS file systems do not support hard link across directories.

I believe that cloud AWS or GCP object storage does not support soft link. I also try to change the hardlink under hca/dss/util/init.py to softlink, and it does not work.

The only temporary solution is to copy files from .hca to the target directory. Then I delete the .hca folder.

~/.local/lib/python3.7/site-packages/hca/dss/util/init.py

def hardlink(source, link_name):
    """
    Create a hardlink in a thread safe way, and revert to copying if the link
    limit for the file is reached
    """
    try:
        os.link(source, link_name)
    except FileExistsError:
        # It's possible that the user created a different file with the same name as the
        # one we're trying to download. Thus we need to check the if the inode is different
        # and raise an error in this case.
        source_stat = os.stat(source)
        dest_stat = os.stat(link_name)
        # Check device first because different drives can have the same inode number
        if source_stat.st_dev != dest_stat.st_dev or source_stat.st_ino != dest_stat.st_ino:
            raise
    except OSError as e:
        if e.errno == errno.EMLINK:
            # FIXME: Copying is not space efficient; see https://github.com/HumanCellAtlas/dcp-cli/issues/453
            log.warning('Failed to link source `%s` to destination `%s`; reverting to copying', source, link_name)
            shutil.copyfile(source, link_name)
        else:
            log.warning('Failed to link source `%s` to destination `%s`; reverting to copying', source, link_name)
            shutil.copyfile(source, link_name)
           # raise
hannes-ucsc commented 4 years ago

We should include errno 1 in the set of errnos for which we fall back to copying.

I'm hesitant to fall back to copying on every failure but I may be convinced otherwise. I think there is a class of problems that are intermittent or easily resolved for which we want to actually raise the problem so the user can fix it.

hannes-ucsc commented 4 years ago

Release together with fix for #515, then point to release, demo and close.

hannes-ucsc commented 3 years ago

No further DCP CLI releases planned. We will not be able to demo this. Closing.