Open mattc-eostar opened 3 years ago
I'm facing the same issue with adlfs 2021.10.0. It seems to be caused by Azure Blob Storage not supporting copying of directories (even if they are empty).
This is what seems to be happening:
fs.mv()
calls fs.copy()
, which in turn calls fs.expand_path()
to get the list of files to copy. When recursive=True
, this returns all the files under the path. The last line of fs.expand_path()
sorts the list of paths before returning it:
return list(sorted(out))
So given this set of files and directories:
container/
folder/
sub1/
file1.txt
file2.txt
sub2/
file3.txt
and we call fs.mv('container/folder', 'container/folder2', recursive=True)
, then fs.expand_path()
returns:
[
'container/folder',
'container/folder/sub1',
'container/folder/sub1/file1.txt',
'container/folder/sub1/file2.txt',
'container/folder/sub2',
'container/folder/sub2/file3.txt'
]
fs.copy()
then iterates through this list to perform the copy operation:
for p1, p2 in zip(paths, path2):
try:
self.cp_file(p1, p2, **kwargs)
except FileNotFoundError:
if on_error == "raise":
raise
The problem is that Azure Blob Storage does not support copying of directories. If we attempt the copy on each of the individual paths listed above, the copy succeeds on the files (e.g. container/folder/sub1/file1.txt
) but fails on the directories.
@mattc-eostar @aloysius-lim
Whenever you create a Storage Account on Microsoft Azure, you have three kinds of Azure Storage Account available. I encountered the same issue when I was writing to kind = "StorageV2". This kind does not support move and copy methods for moving data from temporary storage. I created another Storage Account with kind = "BlobStorage" and tried the same operation, this underlying StorageKind worked fine for me and supported move and copy methods.
How to fix this issue? it is been there for long with adls v2
It's not a great solution, but I wrote a custom function as a workaround that finds all the files first and then moves them. It uses the builtin concurrency to keep things somewhat performant.
srcList = fs.find(src, withdirs=False)
dstList = [file.replace(src, dst) for file in srcList]
fs.mv(srcList, dstList)
fs.rm(src, recursive=True)
The fs.rm(recursive=True)
also seems to struggle. I wrote an fs_rm
function
fileList = fs.find(path, withdirs=False)
if fileList:
fs.rm(fileList)
# Split by level (cannot delete non-empty directories)
dirList = fs.find(path, withdirs=True)
if fs.exists(path):
dirList.append(path)
dirsByLevel = defaultdict(list)
for folder in dirList:
dirsByLevel[len(folder.split('/'))].append(folder)
# Delete from deepest to shallowest
for level in sorted(dirsByLevel.keys(), reverse=True):
fs.rm(dirsByLevel[level])
What happened:
Error occurred on move.
Checked the data lake and the files did copy to new location successfully but were not removed from previous location.
What you expected to happen:
The files within that directory should be moved to the new directory and then removed from the old directory.
Minimal Complete Verifiable Example:
Anything else we need to know?:
Environment: Databricks, adlfs==2021.7.0