HumanCellAtlas / dcp-cli

DEPRECATED - HCA Data Coordination Platform Command Line Interface
https://hca.readthedocs.io/
MIT License
6 stars 8 forks source link

Develop script to restore the zarr directory structure following download #518

Closed theathorn closed 4 years ago

theathorn commented 4 years ago

See https://humancellatlas.zendesk.com/agent/tickets/174.

After selecting matrix files, generating a manifest, and using the HCA CLI to download the zarr files for a project, the zarr files are stored in a flattened instead of hierarchical directory structure.

A script needs to be provided to correct this.

jessebrennan commented 4 years ago

Here is the script. It needs to be run with Python 3.6 or later. I've tested it and it seems to work fine.

#! /usr/bin/python3
import csv
import os
import sys
from pathlib import Path

from argparse import ArgumentParser

def compose_zarrs(file_path):
    manifest = Path(file_path)
    download_dir = manifest.parent.absolute()
    with open(manifest) as f:
        reader = csv.DictReader(f, delimiter='\t', dialect='excel-tab')
        rows = list(reader)
    dir_separators = ['/', '!']
    for row in rows:
        file_name = row['file_name']
        file_path = download_dir / row['file_path']
        seps = [sep for sep in dir_separators if sep in file_name]
        if len(seps) == 0:
            continue
        elif len(seps) == 1:
            sep = seps[0]
            sub_path = Path(*file_name.split(sep))
            full_path = download_dir / sub_path
            full_path.parent.mkdir(parents=True, exist_ok=True)
            try:
                os.link(file_path, full_path)
            except FileExistsError:
                pass
        else:
            raise ValueError(f'File {file_name} has multiple separators: {seps}.')

def main(argv):
    parser = ArgumentParser(
        description="Parse the manifest that is rewritten by the CLI download to get "
                    "the downloaded files' paths. Use this to compose zarray stores "
                    "into their expected, nested directory format."
    )
    parser.add_argument('file_path', help='path to manifest file')
    options = parser.parse_args(argv)
    compose_zarrs(options.file_path)

if __name__ == '__main__':
    main(sys.argv[1:])
jessebrennan commented 4 years ago

@achave11 Could you review this? Basically just test the program and see if it works as expected. LMK if you need more context for this.

theathorn commented 4 years ago

Sent to customer 3/4/20.

theathorn commented 4 years ago

Customer says OK to close this ticket.