Closed hdgarrood closed 7 years ago
Sounds sensible enough. glob
and the fabulously named globDir1
already do that anyway. (It'd probably make sense to rename at least globDir1
in conjunction with this work.)
Another possibility may be to wrap these "get non-matching files" getRecursiveContents
calls in unsafeInterleaveIO
, which should also provide the performance improvement desired while being easier to implement?
Did I understand correctly from your comment in #10 that you want to work on this? I can probably get to it within a week or two if you'd rather have me do it, but my personal time constraints would almost certainly make me go for the lower-effort solution.
Yes, I'd like to work on this. I'll also try sprinkling in some calls to unsafeInterleaveIO
to see what effect that has. I'm not sure what the new API would look like, though. What do you think would be a good new name for globDir1
, for instance? Would we want to provide an alternate version of globDir
as well?
I was thinking that at minimum there would have to be a different version of globDirWith
. I think naming is bound to get awkward with this approach: globDirNoUnmatchedWith
?? Maybe rather rename the existing one to globDirUnmatchedWith
? All the functions should match whatever convention is selected, except for plain glob
which I feel is a good "entry-level" function as-is.
Ideally I think that at this point a new options structure should exist. Then we'd have something like:
data GlobOptions = GlobOptions {
matchOptions :: MatchOptions,
includeUnmatched :: Bool,
}
-- If includeUnmatched is false, the second half of the returned pair is [].
globDirWith :: GlobOptions -> [Pattern] -> FilePath -> IO ([[FilePath]], [FilePath])
I feel that globDir
itself should also return only IO [[FilePath]]
, removing the surprising return type distinction between it and globDir1
. I don't think it's necessary to have a function with the current type of globDir
, especially since I can't think of a good name for it. ;-)
There may also be value in structuring the return type a bit more, either just going to IO ([[FilePath]], Maybe [FilePath])
or more dubiously a full-blown data type, something like the following:
data GlobResult = GlobResult {
matched :: [[FilePath]],
unmatched :: Maybe [FilePath],
}
But that feels like it might end up being overkill.
Ok, that all sounds sensible. I was testing with the following program:
module Main where
import System.Environment (getArgs)
import System.FilePath.Glob (glob)
main = do
args <- getArgs
files <- concat <$> mapM glob args
files' <- concat <$> mapM glob files
putStrLn $ show files'
passing a glob bower_components/purescript-*/src/**/*.purs
, which was matching 544 files, and I got the following results:
current
real 0m21.861s user 0m16.608s sys 0m5.072s
with unsafeInterleaveIO to return all unmatched files lazily
real 0m1.507s user 0m1.168s sys 0m0.332s
with modified GlobOptions
to avoid collecting unmatched files at all
real 0m0.951s user 0m0.688s sys 0m0.256s
Since I've already done the work and the GlobOptions
approach is probably also better from the point of view of memory usage, I think that one looks more promising.
Great!
After a little bit of digging it appears that the cause of a globbing performance issue in the PureScript compiler is all the calls to
getRecursiveContents
inside thedidn'tMatch
function inSystem.Glob.FilePath.Directory
. Would you consider exposing an alternative version of theglobDirWith
function which only returns the matches? I imagine that many other users of this library don't need the list of non-matching files either.