dask / hdfs3

A wrapper for libhdfs3 to interact with HDFS from Python
http://hdfs3.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
136 stars 40 forks source link

Provide a function to recursively put a directory with subdirectories to HDFS #124

Open FlorianWilhelm opened 7 years ago

FlorianWilhelm commented 7 years ago

Right now it is only possible to recursively delete a directory. It would be nice if hdfs3 came with a function out of the box that allows recursively pushing a directory to HDFS.

A naive implementation would be something like:

class cd(object):
    """Context manager for changing the current working directory"""
    def __init__(self, new_path):
        self.new_path = os.path.expanduser(new_path)

    def __enter__(self):
        self.old_path = os.getcwd()
        os.chdir(self.new_path)

    def __exit__(self, etype, value, traceback):
        os.chdir(self.old_path)

def put_dir(hdfs_client, origin_path, destination_path):
    """Recursively push a directory to HDFS

    Args:
        hdfs_client: hdfs client
        origin_path: origin path
        destination_path: destination path on HDFS
    """
    hdfs_client.mkdir(destination_path)
    with cd(origin_path):
        for root, dirs, files in os.walk('./'):
            dest_path = os.path.join(destination_path, root)
            for dir in dirs:
                hdfs_client.mkdir(os.path.join(dest_path, dir))
            for file in files:
                file_path = os.path.join(root, file)
                hdfs_client.put(file_path, os.path.join(dest_path, file))

But this is pretty slow in practice.

martindurant commented 7 years ago

This seems like a thing we can certainly do. I would use os.walk with the origin_path rather than changing the working directory, and I would make this a method on the HDFileSystem rather than a separate function. Would you care to make these changes and submit a PR, with tests? It would be nice to have the reverse method for fetching a set of directories to the local disc.

For further thought: the code would be essentially the same with s3fs or gcsfs, and could be implemented in parallel with dask to make a full distcp-like clone.

FlorianWilhelm commented 7 years ago

@martindurant The only problem is that it is really slow somehow compared to calling hdfs dfs -put mydir*, so I think it's not really a practical solution. For me it was only a short hack to see if it's worth the effort implementing it in hdfs3. Do you know why hdfs is so much faster? I guess they do something like concatenating all the files into one big chunk which reduces the overhead of many small files and lots of calls.

martindurant commented 7 years ago

I am surprised that things should be slow for you. Have you set short-circuit operations as here? Would you mind profiling your code to find out where the slowness is? Could it be that hdfs dfs is using multiple threads/processes (something we could in theory implement).