Deewiant / glob

Haskell library for glob matching
https://deewiant.iki.fi/projects/glob/
Other
25 stars 8 forks source link

Variant of globDir which does not collect a list of non-matching files #11

Closed hdgarrood closed 7 years ago

hdgarrood commented 7 years ago

After a little bit of digging it appears that the cause of a globbing performance issue in the PureScript compiler is all the calls to getRecursiveContents inside the didn'tMatch function in System.Glob.FilePath.Directory. Would you consider exposing an alternative version of the globDirWith function which only returns the matches? I imagine that many other users of this library don't need the list of non-matching files either.

Deewiant commented 7 years ago

Sounds sensible enough. glob and the fabulously named globDir1 already do that anyway. (It'd probably make sense to rename at least globDir1 in conjunction with this work.)

Another possibility may be to wrap these "get non-matching files" getRecursiveContents calls in unsafeInterleaveIO, which should also provide the performance improvement desired while being easier to implement?

Did I understand correctly from your comment in #10 that you want to work on this? I can probably get to it within a week or two if you'd rather have me do it, but my personal time constraints would almost certainly make me go for the lower-effort solution.

hdgarrood commented 7 years ago

Yes, I'd like to work on this. I'll also try sprinkling in some calls to unsafeInterleaveIO to see what effect that has. I'm not sure what the new API would look like, though. What do you think would be a good new name for globDir1, for instance? Would we want to provide an alternate version of globDir as well?

Deewiant commented 7 years ago

I was thinking that at minimum there would have to be a different version of globDirWith. I think naming is bound to get awkward with this approach: globDirNoUnmatchedWith?? Maybe rather rename the existing one to globDirUnmatchedWith? All the functions should match whatever convention is selected, except for plain glob which I feel is a good "entry-level" function as-is.

Ideally I think that at this point a new options structure should exist. Then we'd have something like:

data GlobOptions = GlobOptions {
   matchOptions :: MatchOptions,
   includeUnmatched :: Bool,
}

-- If includeUnmatched is false, the second half of the returned pair is [].
globDirWith :: GlobOptions -> [Pattern] -> FilePath -> IO ([[FilePath]], [FilePath])

I feel that globDir itself should also return only IO [[FilePath]], removing the surprising return type distinction between it and globDir1. I don't think it's necessary to have a function with the current type of globDir, especially since I can't think of a good name for it. ;-)

There may also be value in structuring the return type a bit more, either just going to IO ([[FilePath]], Maybe [FilePath]) or more dubiously a full-blown data type, something like the following:

data GlobResult = GlobResult {
   matched :: [[FilePath]],
   unmatched :: Maybe [FilePath],
}

But that feels like it might end up being overkill.

hdgarrood commented 7 years ago

Ok, that all sounds sensible. I was testing with the following program:

module Main where

import System.Environment (getArgs)
import System.FilePath.Glob (glob)

main = do
  args <- getArgs
  files <- concat <$> mapM glob args
  files' <- concat <$> mapM glob files
  putStrLn $ show files'

passing a glob bower_components/purescript-*/src/**/*.purs, which was matching 544 files, and I got the following results:

current

real 0m21.861s user 0m16.608s sys 0m5.072s

with unsafeInterleaveIO to return all unmatched files lazily

real 0m1.507s user 0m1.168s sys 0m0.332s

with modified GlobOptions to avoid collecting unmatched files at all

real 0m0.951s user 0m0.688s sys 0m0.256s

Since I've already done the work and the GlobOptions approach is probably also better from the point of view of memory usage, I think that one looks more promising.

Deewiant commented 7 years ago

Great!