Open FlorianWilhelm opened 7 years ago
This seems like a thing we can certainly do.
I would use os.walk
with the origin_path
rather than changing the working directory, and I would make this a method on the HDFileSystem rather than a separate function. Would you care to make these changes and submit a PR, with tests? It would be nice to have the reverse method for fetching a set of directories to the local disc.
For further thought: the code would be essentially the same with s3fs or gcsfs, and could be implemented in parallel with dask to make a full distcp
-like clone.
@martindurant The only problem is that it is really slow somehow compared to calling hdfs dfs -put mydir*
, so I think it's not really a practical solution. For me it was only a short hack to see if it's worth the effort implementing it in hdfs3. Do you know why hdfs
is so much faster? I guess they do something like concatenating all the files into one big chunk which reduces the overhead of many small files and lots of calls.
I am surprised that things should be slow for you. Have you set short-circuit operations as here?
Would you mind profiling your code to find out where the slowness is? Could it be that hdfs dfs
is using multiple threads/processes (something we could in theory implement).
Right now it is only possible to recursively delete a directory. It would be nice if hdfs3 came with a function out of the box that allows recursively pushing a directory to HDFS.
A naive implementation would be something like:
But this is pretty slow in practice.