crs4 / pydoop

A Python MapReduce and HDFS API for Hadoop
Apache License 2.0
236 stars 59 forks source link

How do I check whether there exist files matched to complex wildcard pattern ? #364

Closed yustoris closed 4 years ago

yustoris commented 4 years ago

I've already read https://github.com/crs4/pydoop/issues/12, however I could not figure out how to check whether the files that match more complex patterns eg. /something/a-b-[cdef]-*/part-* . In my case, I could not determine where wildcards are inserted into the patterns.

Do I have to walk all paths from the root of HDFS ? I would not like to do that because there are too many files in my HDFS.

simleo commented 4 years ago

If you need more complex pattern matching than fnmatch can offer, you probably need to use regular expression. In any event, I don't see how you can avoid walking the whole tree where files that you need to match might be. There is a walk tool for this. You can apply your matching pattern to every item (with fnmatch or re) yielded by walk.

yustoris commented 4 years ago

Thank you for your quick response! I've already tried the walk tool in pydoop, but it took approx. 10~50 times longer compared to the bare hadoop command...

However, as you suggest, it seems that there is no way to search more efficiently without fully walking files.