Open tfelbr opened 1 year ago
One thing to note, is that in remote blob filesystems, what counts is the number of calls, and find()
is often one-shot and the fastest way to proceed, as opposed to listing each level and possible sublevel. However, when the code goes explicitly through walk
, you have a good point.
I linked a recent PR which allows for the directories to walk to be edited by the caller during iteration if in top-down (depth-first) mode, and this could handily solve the case for you.
Sorry for my late reply, there were some external circumstances that prevented me from responding.
So thank you at first for your answer! That is actually a really practical and good to know feature, but I'm still not sure if that can help me in my case. I am using the fsspec.open_files()
function which in turn will call glob()
and then find()
. find()
by itself, in its base implementation, propagates this call to walk()
. As I understand it, neither the call to open_files()
nor glob()
nor find()
lets me control the iteration of walk()
and modify the list of dirnames. I would have to call walk()
manually but then I would lose all the features the other methods and functions provide, especially glob()
.
Thank you for your advice again, and again please correct me if I missed anything here.
I see what you are saying.
I don't quite picture how you might change the API in open_files and all the way down to allow filtering/pruning of the filesystem walk. Indeed you could use the existing way I mention above to get listings only in parts of the tree you require, and extract a list of matching paths to pass to open_files; the actual pattern matching in glob is pretty simple.
hi, dont know whether this issue is the same as what you raise above.
I'm experiencing a performance problem when using a glob pattern like hdfs:///user/hive/warehouse/db/table/part_date=20240802*/*.parquet
in the HDFS filesystem. It either hangs or takes 10+ minutes to return results. However, using the same pattern with the hdfs dfs -ls
command returns results in just a few seconds.
Interestingly, when I use fsspec.glob with path hdfs:///user/hive/wareshoue/db/table/part_date=20240802*
, the result returned in just a few secs
%timeit -r3
fs.glob("hdfs:///user/hive/warehouse/db/table/part_date=20240822*")
['/user/hive/warehouse/db/table/part_date=2024082200', '/user/hive/warehouse/db/table/part_date=2024082201', '/user/hive/warehouse/db/table/part_date=2024082202', '/user/hive/warehouse/db/table/part_date=2024082203']
2.79 s ± 98.3 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
1) It seems like fsspec glob might be walking through all files in the root path hdfs:///user/hive/warehouse/db/table
to fetch files, and then it tries to match the pattern. Is that correct?
2) do you have any tips to increase performance in this case?
thanks
You should be able to turn on logging to see what exact operations are being performed: are directories excluded by the upper part of the path being listed?
It's worth noting that HDFS support is provided via pyarrow; we could intercept glob and imrpove it, but as things are, this is not really calling fsspec code (I think).
@martindurant I ran profiling the fs.glob
function on my HDFS directory
# 1
profiled_glob("hdfs:///user/hive/warehouse/a.db/table/part_date=20240825*")
# 2
profiled_glob("hdfs:///user/hive/warehouse/a.db/table/part_date=20240825*/*.c000")
in both cases, the result always take most of the time at line which returned allpaths under root
allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
Timer unit: 1e-09 s
Total time: 2.71698 s File: /home/n/venv/lib/python3.9/site-packages/fsspec/spec.py Function: glob at line 545
545 def glob(self, path, maxdepth=None, kwargs):
546 """
547 Find files by glob-matching.
548
549 If the path ends with '/', only folders are returned.
550
551 We support ``"", 552
"?"and
"[..]". We do not support ^ for pattern negation. 553 554 The `maxdepth` option is applied on the first `**` found in the path. 555 556 kwargs are passed to
ls``.
557 """
558 1 1552.0 1552.0 0.0 if maxdepth is not None and maxdepth < 1:
559 raise ValueError("maxdepth must be at least 1")
560
561 1 2433.0 2433.0 0.0 import re
562
563 1 1748.0 1748.0 0.0 seps = (os.path.sep, os.path.altsep) if os.path.altsep else (os.path.sep,)
564 1 2089.0 2089.0 0.0 ends_with_sep = path.endswith(seps) # _strip_protocol strips trailing slash
565 1 145822.0 145822.0 0.0 path = self._strip_protocol(path)
566 2 3828.0 1914.0 0.0 append_slash_to_dirname = ends_with_sep or path.endswith(
567 1 10524.0 10524.0 0.0 tuple(sep + "" for sep in seps)
568 )
569 1 2993.0 2993.0 0.0 idx_star = path.find("") if path.find("") >= 0 else len(path)
570 1 2135.0 2135.0 0.0 idx_qmark = path.find("?") if path.find("?") >= 0 else len(path)
571 1 1081.0 1081.0 0.0 idx_brace = path.find("[") if path.find("[") >= 0 else len(path)
572
573 1 2500.0 2500.0 0.0 min_idx = min(idx_star, idx_qmark, idx_brace)
574
575 1 1193.0 1193.0 0.0 detail = kwargs.pop("detail", False)
576
577 1 11975.0 11975.0 0.0 if not has_magic(path):
578 if self.exists(path, kwargs):
579 if not detail:
580 return [path]
581 else:
582 return {path: self.info(path, kwargs)}
583 else:
584 if not detail:
585 return [] # glob of non-existent returns empty
586 else:
587 return {}
588 1 1666.0 1666.0 0.0 elif "/" in path[:min_idx]:
589 1 2499.0 2499.0 0.0 min_idx = path[:min_idx].rindex("/")
590 1 831.0 831.0 0.0 root = path[: min_idx + 1]
591 1 2788.0 2788.0 0.0 depth = path[min_idx + 1 :].count("/") + 1
592 else:
593 root = ""
594 depth = path[min_idx + 1 :].count("/") + 1
595
596 1 793.0 793.0 0.0 if "" in path:
597 if maxdepth is not None:
598 idx_double_stars = path.find("")
599 depth_double_stars = path[idx_double_stars:].count("/") + 1
600 depth = depth - depth_double_stars + maxdepth
601 else:
602 depth = None
603
604 1 2696123724.0 3e+09 99.2 allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, kwargs)
605
606 1 285860.0 285860.0 0.0 pattern = glob_translate(path + ("/" if ends_with_sep else ""))
607 1 5354.0 5354.0 0.0 pattern = re.compile(pattern)
608
609 2 17518646.0 9e+06 0.6 out = {
610 p: info
611 1 2848568.0 3e+06 0.1 for p, info in sorted(allpaths.items())
612 if pattern.match(
613 (
614 p + "/"
615 if append_slash_to_dirname and info["type"] == "directory"
616 else p
617 )
618 )
619 }
620
621 1 1150.0 1150.0 0.0 if detail:
622 return out
623 else:
624 1 3190.0 3190.0 0.0 return list(out)
- case 2: hang up since a seems take a lot of time to finish (~50 mins)
Timer unit: 1e-09 s
Total time: 44.7239 s File: /home/n/venv/lib/python3.9/site-packages/fsspec/spec.py Function: glob at line 545
545 def glob(self, path, maxdepth=None, kwargs):
546 """
547 Find files by glob-matching.
548
549 If the path ends with '/', only folders are returned.
550
551 We support ``"", 552
"?"and
"[..]". We do not support ^ for pattern negation. 553 554 The `maxdepth` option is applied on the first `**` found in the path. 555 556 kwargs are passed to
ls``.
557 """
558 1 1061.0 1061.0 0.0 if maxdepth is not None and maxdepth < 1:
559 raise ValueError("maxdepth must be at least 1")
560
561 1 1657.0 1657.0 0.0 import re
562
563 1 1552.0 1552.0 0.0 seps = (os.path.sep, os.path.altsep) if os.path.altsep else (os.path.sep,)
564 1 1239.0 1239.0 0.0 ends_with_sep = path.endswith(seps) # _strip_protocol strips trailing slash
565 1 49725.0 49725.0 0.0 path = self._strip_protocol(path)
566 2 1632.0 816.0 0.0 append_slash_to_dirname = ends_with_sep or path.endswith(
567 1 3535.0 3535.0 0.0 tuple(sep + "" for sep in seps)
568 )
569 1 955.0 955.0 0.0 idx_star = path.find("") if path.find("") >= 0 else len(path)
570 1 731.0 731.0 0.0 idx_qmark = path.find("?") if path.find("?") >= 0 else len(path)
571 1 444.0 444.0 0.0 idx_brace = path.find("[") if path.find("[") >= 0 else len(path)
572
573 1 739.0 739.0 0.0 min_idx = min(idx_star, idx_qmark, idx_brace)
574
575 1 504.0 504.0 0.0 detail = kwargs.pop("detail", False)
576
577 1 5415.0 5415.0 0.0 if not has_magic(path):
578 if self.exists(path, kwargs):
579 if not detail:
580 return [path]
581 else:
582 return {path: self.info(path, kwargs)}
583 else:
584 if not detail:
585 return [] # glob of non-existent returns empty
586 else:
587 return {}
588 1 635.0 635.0 0.0 elif "/" in path[:min_idx]:
589 1 1503.0 1503.0 0.0 min_idx = path[:min_idx].rindex("/")
590 1 480.0 480.0 0.0 root = path[: min_idx + 1]
591 1 1396.0 1396.0 0.0 depth = path[min_idx + 1 :].count("/") + 1
592 else:
593 root = ""
594 depth = path[min_idx + 1 :].count("/") + 1
595
596 1 429.0 429.0 0.0 if "" in path:
597 if maxdepth is not None:
598 idx_double_stars = path.find("")
599 depth_double_stars = path[idx_double_stars:].count("/") + 1
600 depth = depth - depth_double_stars + maxdepth
601 else:
602 depth = None
603 1 63764.0 63764.0 0.0 print(kwargs)
604 1 4e+10 4e+10 100.0 allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, kwargs)
605 pattern = glob_translate(path + ("/" if ends_with_sep else ""))
606 pattern = re.compile(pattern)
607
608 out = {
609 p: info
610 for p, info in sorted(allpaths.items())
611 if pattern.match(
612 (
613 p + "/"
614 if append_slash_to_dirname and info["type"] == "directory"
615 else p
616 )
617 )
618 }
619
620 if detail:
621 return out
622 else:
623 return list(out)
Total time: 44.7239 s File: /tmp/ipykernel_7873/3826323558.py Function: profiled_glob at line 9
it seems that with the current glob implementation, it tried to get all paths under root then matching pattern at client side, can we have an option to push down the glob filter to the filesystem to boost performance?
It would appear that find
on HDFS is not respecting the depth
parameter. The first of your calls should only be effectively a single call to ls
, and the second a number of additional calls to the first level subdirectories. Unbounded find
should only be called when the pattern contains "**".
Of course, find
can be implemented with ls
/walk
, so it woudl be worthwhile finding out exactly what is getting called here.
@martindurant I see, my usage of fsspec
is within DuckDB
to interact with HDFS, which implicitly calls glob to scan directories. Unfortunately, the performance isn't good, likely due to reasons related to the large number of files, partitions, and HDFS namenode limitations.
Sorry, this has gone quiet. Was there a specific way you think we can implement faster globbing for your situation?
Hello, I am using the build-in SFTP implementation to access an SFTP server. Unfortunately, this server is pretty slow, so I used globbing with the hope it will speed up things.
The files on this server are ordered in different subdirectories. Only files in all subdirs that start with an "A" are relevant to me. The url I tried does look like this:
The outcome was as I expected, only files inside the dirs that start with an "A" were returned. But the performance was still bad so I decided to debug the whole thing and discovered that the filesystem still lists all subdirectories regardless of their name. I had a look into the implementation of the glob() function in the AbstractFileSystem and believe that these lines are responsible for that:
It seems the filesystem first looks up all files and filters them afterwards. While this certainly delivers the expected results it is still slow as the server has to list all directories nevertheless.
I would like to propose to change the implementation and make it more performant, not only for SFTP. My first naive idea is to expand the existing find() and walk() functions (or add new ones) with the ability to apply regex filters while traversing the tree. Therefore the globbed path could be split up and a regex could be created for every subdirectory, as it is the case right now with the complete path. Then this list of regex strings could be used to filter the path of every subdirectory.
Please tell me what you think and if this could be a possible solution. In case I missed or overlooked something feel free to correct me!