fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
998 stars 353 forks source link

Proposal to change implementation of globbing for possible performance increase #1355

Open tfelbr opened 1 year ago

tfelbr commented 1 year ago

Hello, I am using the build-in SFTP implementation to access an SFTP server. Unfortunately, this server is pretty slow, so I used globbing with the hope it will speed up things.

The files on this server are ordered in different subdirectories. Only files in all subdirs that start with an "A" are relevant to me. The url I tried does look like this:

sftp://user:password@sftp_host/root/path/A*/*.zip

The outcome was as I expected, only files inside the dirs that start with an "A" were returned. But the performance was still bad so I decided to debug the whole thing and discovered that the filesystem still lists all subdirectories regardless of their name. I had a look into the implementation of the glob() function in the AbstractFileSystem and believe that these lines are responsible for that:

allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
# Escape characters special to python regex, leaving our supported
# special characters in place.
# See https://www.gnu.org/software/bash/manual/html_node/Pattern-Matching.html
# for shell globbing details.
pattern = (
    "^"
    + (
        path.replace("\\", r"\\")
        .replace(".", r"\.")
        .replace("+", r"\+")
        .replace("//", "/")
        .replace("(", r"\(")
        .replace(")", r"\)")
        .replace("|", r"\|")
        .replace("^", r"\^")
        .replace("$", r"\$")
        .replace("{", r"\{")
        .replace("}", r"\}")
        .rstrip("/")
        .replace("?", ".")
    )
    + "$"
)
pattern = re.sub("/[*]{2}", "=SLASH_DOUBLE_STARS=", pattern)
pattern = re.sub("[*]{2}/?", "=DOUBLE_STARS=", pattern)
pattern = re.sub("[*]", "[^/]*", pattern)
pattern = re.sub("=SLASH_DOUBLE_STARS=", "(|/.*)", pattern)
pattern = re.sub("=DOUBLE_STARS=", ".*", pattern)
pattern = re.compile(pattern)

out = {
    p: allpaths[p]
    for p in sorted(allpaths)
    if pattern.match(p.replace("//", "/").rstrip("/"))
}

It seems the filesystem first looks up all files and filters them afterwards. While this certainly delivers the expected results it is still slow as the server has to list all directories nevertheless.

I would like to propose to change the implementation and make it more performant, not only for SFTP. My first naive idea is to expand the existing find() and walk() functions (or add new ones) with the ability to apply regex filters while traversing the tree. Therefore the globbed path could be split up and a regex could be created for every subdirectory, as it is the case right now with the complete path. Then this list of regex strings could be used to filter the path of every subdirectory.

Please tell me what you think and if this could be a possible solution. In case I missed or overlooked something feel free to correct me!

martindurant commented 1 year ago

Ref: https://github.com/fsspec/filesystem_spec/pull/1263

martindurant commented 1 year ago

One thing to note, is that in remote blob filesystems, what counts is the number of calls, and find() is often one-shot and the fastest way to proceed, as opposed to listing each level and possible sublevel. However, when the code goes explicitly through walk, you have a good point.

I linked a recent PR which allows for the directories to walk to be edited by the caller during iteration if in top-down (depth-first) mode, and this could handily solve the case for you.

tfelbr commented 1 year ago

Sorry for my late reply, there were some external circumstances that prevented me from responding.

So thank you at first for your answer! That is actually a really practical and good to know feature, but I'm still not sure if that can help me in my case. I am using the fsspec.open_files() function which in turn will call glob() and then find(). find() by itself, in its base implementation, propagates this call to walk(). As I understand it, neither the call to open_files() nor glob() nor find() lets me control the iteration of walk() and modify the list of dirnames. I would have to call walk() manually but then I would lose all the features the other methods and functions provide, especially glob().

Thank you for your advice again, and again please correct me if I missed anything here.

martindurant commented 1 year ago

I see what you are saying.

I don't quite picture how you might change the API in open_files and all the way down to allow filtering/pruning of the filesystem walk. Indeed you could use the existing way I mention above to get listings only in parts of the tree you require, and extract a list of matching paths to pass to open_files; the actual pattern matching in glob is pretty simple.

ntnhaatj commented 1 month ago

hi, dont know whether this issue is the same as what you raise above.

I'm experiencing a performance problem when using a glob pattern like hdfs:///user/hive/warehouse/db/table/part_date=20240802*/*.parquet in the HDFS filesystem. It either hangs or takes 10+ minutes to return results. However, using the same pattern with the hdfs dfs -ls command returns results in just a few seconds.

Interestingly, when I use fsspec.glob with path hdfs:///user/hive/wareshoue/db/table/part_date=20240802*, the result returned in just a few secs

%timeit -r3
fs.glob("hdfs:///user/hive/warehouse/db/table/part_date=20240822*")

['/user/hive/warehouse/db/table/part_date=2024082200', '/user/hive/warehouse/db/table/part_date=2024082201', '/user/hive/warehouse/db/table/part_date=2024082202', '/user/hive/warehouse/db/table/part_date=2024082203']

2.79 s ± 98.3 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

1) It seems like fsspec glob might be walking through all files in the root path hdfs:///user/hive/warehouse/db/table to fetch files, and then it tries to match the pattern. Is that correct? 2) do you have any tips to increase performance in this case?

thanks

martindurant commented 4 weeks ago

You should be able to turn on logging to see what exact operations are being performed: are directories excluded by the upper part of the path being listed?

It's worth noting that HDFS support is provided via pyarrow; we could intercept glob and imrpove it, but as things are, this is not really calling fsspec code (I think).

ntnhaatj commented 3 weeks ago

@martindurant I ran profiling the fs.glob function on my HDFS directory

# 1
profiled_glob("hdfs:///user/hive/warehouse/a.db/table/part_date=20240825*")
# 2
profiled_glob("hdfs:///user/hive/warehouse/a.db/table/part_date=20240825*/*.c000")

in both cases, the result always take most of the time at line which returned allpaths under root

allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)

Total time: 2.71698 s File: /home/n/venv/lib/python3.9/site-packages/fsspec/spec.py Function: glob at line 545

Line # Hits Time Per Hit % Time Line Contents

545 def glob(self, path, maxdepth=None, kwargs): 546 """ 547 Find files by glob-matching. 548
549 If the path ends with '/', only folders are returned. 550
551 We support ``"
", 552"?"and"[..]". We do not support ^ for pattern negation. 553 554 The `maxdepth` option is applied on the first `**` found in the path. 555 556 kwargs are passed tols``. 557 """ 558 1 1552.0 1552.0 0.0 if maxdepth is not None and maxdepth < 1: 559 raise ValueError("maxdepth must be at least 1") 560
561 1 2433.0 2433.0 0.0 import re 562
563 1 1748.0 1748.0 0.0 seps = (os.path.sep, os.path.altsep) if os.path.altsep else (os.path.sep,) 564 1 2089.0 2089.0 0.0 ends_with_sep = path.endswith(seps) # _strip_protocol strips trailing slash 565 1 145822.0 145822.0 0.0 path = self._strip_protocol(path) 566 2 3828.0 1914.0 0.0 append_slash_to_dirname = ends_with_sep or path.endswith( 567 1 10524.0 10524.0 0.0 tuple(sep + "" for sep in seps) 568 ) 569 1 2993.0 2993.0 0.0 idx_star = path.find("") if path.find("") >= 0 else len(path) 570 1 2135.0 2135.0 0.0 idx_qmark = path.find("?") if path.find("?") >= 0 else len(path) 571 1 1081.0 1081.0 0.0 idx_brace = path.find("[") if path.find("[") >= 0 else len(path) 572
573 1 2500.0 2500.0 0.0 min_idx = min(idx_star, idx_qmark, idx_brace) 574
575 1 1193.0 1193.0 0.0 detail = kwargs.pop("detail", False) 576
577 1 11975.0 11975.0 0.0 if not has_magic(path): 578 if self.exists(path,
kwargs): 579 if not detail: 580 return [path] 581 else: 582 return {path: self.info(path, kwargs)} 583 else: 584 if not detail: 585 return [] # glob of non-existent returns empty 586 else: 587 return {} 588 1 1666.0 1666.0 0.0 elif "/" in path[:min_idx]: 589 1 2499.0 2499.0 0.0 min_idx = path[:min_idx].rindex("/") 590 1 831.0 831.0 0.0 root = path[: min_idx + 1] 591 1 2788.0 2788.0 0.0 depth = path[min_idx + 1 :].count("/") + 1 592 else: 593 root = "" 594 depth = path[min_idx + 1 :].count("/") + 1 595
596 1 793.0 793.0 0.0 if "
" in path: 597 if maxdepth is not None: 598 idx_double_stars = path.find("") 599 depth_double_stars = path[idx_double_stars:].count("/") + 1 600 depth = depth - depth_double_stars + maxdepth 601 else: 602 depth = None 603
604 1 2696123724.0 3e+09 99.2 allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True,
kwargs) 605
606 1 285860.0 285860.0 0.0 pattern = glob_translate(path + ("/" if ends_with_sep else "")) 607 1 5354.0 5354.0 0.0 pattern = re.compile(pattern) 608
609 2 17518646.0 9e+06 0.6 out = { 610 p: info 611 1 2848568.0 3e+06 0.1 for p, info in sorted(allpaths.items()) 612 if pattern.match( 613 ( 614 p + "/" 615 if append_slash_to_dirname and info["type"] == "directory" 616 else p 617 ) 618 ) 619 } 620
621 1 1150.0 1150.0 0.0 if detail: 622 return out 623 else: 624 1 3190.0 3190.0 0.0 return list(out)


-  case 2: hang up since a seems take a lot of time to finish (~50 mins)

Timer unit: 1e-09 s

Total time: 44.7239 s File: /home/n/venv/lib/python3.9/site-packages/fsspec/spec.py Function: glob at line 545

Line # Hits Time Per Hit % Time Line Contents

545 def glob(self, path, maxdepth=None, kwargs): 546 """ 547 Find files by glob-matching. 548
549 If the path ends with '/', only folders are returned. 550
551 We support ``"
", 552"?"and"[..]". We do not support ^ for pattern negation. 553 554 The `maxdepth` option is applied on the first `**` found in the path. 555 556 kwargs are passed tols``. 557 """ 558 1 1061.0 1061.0 0.0 if maxdepth is not None and maxdepth < 1: 559 raise ValueError("maxdepth must be at least 1") 560
561 1 1657.0 1657.0 0.0 import re 562
563 1 1552.0 1552.0 0.0 seps = (os.path.sep, os.path.altsep) if os.path.altsep else (os.path.sep,) 564 1 1239.0 1239.0 0.0 ends_with_sep = path.endswith(seps) # _strip_protocol strips trailing slash 565 1 49725.0 49725.0 0.0 path = self._strip_protocol(path) 566 2 1632.0 816.0 0.0 append_slash_to_dirname = ends_with_sep or path.endswith( 567 1 3535.0 3535.0 0.0 tuple(sep + "" for sep in seps) 568 ) 569 1 955.0 955.0 0.0 idx_star = path.find("") if path.find("") >= 0 else len(path) 570 1 731.0 731.0 0.0 idx_qmark = path.find("?") if path.find("?") >= 0 else len(path) 571 1 444.0 444.0 0.0 idx_brace = path.find("[") if path.find("[") >= 0 else len(path) 572
573 1 739.0 739.0 0.0 min_idx = min(idx_star, idx_qmark, idx_brace) 574
575 1 504.0 504.0 0.0 detail = kwargs.pop("detail", False) 576
577 1 5415.0 5415.0 0.0 if not has_magic(path): 578 if self.exists(path,
kwargs): 579 if not detail: 580 return [path] 581 else: 582 return {path: self.info(path, kwargs)} 583 else: 584 if not detail: 585 return [] # glob of non-existent returns empty 586 else: 587 return {} 588 1 635.0 635.0 0.0 elif "/" in path[:min_idx]: 589 1 1503.0 1503.0 0.0 min_idx = path[:min_idx].rindex("/") 590 1 480.0 480.0 0.0 root = path[: min_idx + 1] 591 1 1396.0 1396.0 0.0 depth = path[min_idx + 1 :].count("/") + 1 592 else: 593 root = "" 594 depth = path[min_idx + 1 :].count("/") + 1 595
596 1 429.0 429.0 0.0 if "
" in path: 597 if maxdepth is not None: 598 idx_double_stars = path.find("") 599 depth_double_stars = path[idx_double_stars:].count("/") + 1 600 depth = depth - depth_double_stars + maxdepth 601 else: 602 depth = None 603 1 63764.0 63764.0 0.0 print(kwargs) 604 1 4e+10 4e+10 100.0 allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, kwargs) 605 pattern = glob_translate(path + ("/" if ends_with_sep else "")) 606 pattern = re.compile(pattern) 607
608 out = { 609 p: info 610 for p, info in sorted(allpaths.items()) 611 if pattern.match( 612 ( 613 p + "/" 614 if append_slash_to_dirname and info["type"] == "directory" 615 else p 616 ) 617 ) 618 } 619
620 if detail: 621 return out 622 else: 623 return list(out)

Total time: 44.7239 s File: /tmp/ipykernel_7873/3826323558.py Function: profiled_glob at line 9

Line # Hits Time Per Hit % Time Line Contents



it seems that with the current glob implementation, it tried to get all paths under root then matching pattern at client side, can we have an option to push down the glob filter to the filesystem to boost performance?
martindurant commented 3 weeks ago

It would appear that find on HDFS is not respecting the depth parameter. The first of your calls should only be effectively a single call to ls, and the second a number of additional calls to the first level subdirectories. Unbounded find should only be called when the pattern contains "**". Of course, find can be implemented with ls/walk, so it woudl be worthwhile finding out exactly what is getting called here.

ntnhaatj commented 3 weeks ago

@martindurant I see, my usage of fsspec is within DuckDB to interact with HDFS, which implicitly calls glob to scan directories. Unfortunately, the performance isn't good, likely due to reasons related to the large number of files, partitions, and HDFS namenode limitations.

martindurant commented 6 days ago

Sorry, this has gone quiet. Was there a specific way you think we can implement faster globbing for your situation?