Open charlesetc opened 3 years ago
Some leads about pre-made solutions:
In my opinion, dune-glob seems like the best solution. It supports not only wildcards but things like **
, ?
, {}
, and []
, and it's actively maintained. However it lacks a bit of documentation since it's internal to dune (it is mainly documented here), and it may be awkward to depend on an internal lib (and the interface may change without warning, maybe).
If we only want/need *
, the more recent ocaml-glob
may be a good solution, or implementing it directly in Feather since it's not that complicated. IMHO, globbing is a nice feature to have and it would make sense to have in Feather rather than in a sibling library.
Actually there are two other solutions that I didn't come across until now, that are probably better (and on opam):
path_glob
(https://gitlab.com/gasche/path_glob), an up to date and documented library extracted from ocamlbuild.Re.glob
(https://github.com/ocaml/ocaml-re/blob/master/lib/glob.mli) from the wildly used ocaml-re libraryThanks for doing this research! Some quick thoughts:
ocaml-glob 1: I really like this approach but it doesn't support **
which seems very useful. I wouldn't mind resurrecting it, but again **
...
ocaml-glob 2: also doesn't support **
dune-glob would be nice except except its semantics for **
are weird, no?
**
matches any character that is not.
followed by anything, except if it comes first in which case it matches anything
Re.Glob seems like it provides what we'd want but also might be a large dependency given our existing small set. We can't include Re.Glob without also including Re.Perl, Re.Posix, and Re.Emacs. Maybe it's negligible though. What do you think?
I find the boolean logic in path_glob a little strange, but it could be useful. I'm a little sad that a single glob has to be quoted awkwardly, i.e. this is for boolean logic:
Glob.parse "<foo/**> or <**/*.ml>"
but then for a normal glob:
Glob.parse "\"foo/**.ml\""
I suppose we could write a helper that just adds the quotes around your item to construct something without the boolean language? Seems kind of hacky because you could escape from the quote, as in Feather.our_helper "foo/*.ml\" or <**/*.ml> or \""
which would translate to Glob.parse "\"foo/*.ml\" or <**/*.ml> or \"\""
. Might still be useful?
Alternatively we could try to open a PR for path_glob to expose a way to parse out just a Path_glob.Ast.atom, without the Formula.t.
Thanks for the answer !
dune-glob would be nice except except its semantics for
**
are weird, no?
**
matches any character that is not.
followed by anything, except if it comes first in which case it matches anything
Indeed, I didn't notice that. It would be weird to work around that.
Re.Glob seems like it provides what we'd want but also might be a large dependency given our existing small set. We can't include Re.Glob without also including Re.Perl, Re.Posix, and Re.Emacs. Maybe it's negligible though. What do you think?
While Re.Perl/Posix/Emacs seems quite negligible in size, Re.Core would be the main offender here (necessary to match in Re.Glob). However in practice the loading time seems negligible. I could do some benchmarks to confirm this.
I find the boolean logic in path_glob a little strange, but it could be useful. I'm a little sad that a single glob has to be quoted awkwardly [...] I suppose we could write a helper that just adds the quotes around your item to construct something without the boolean language? Seems kind of hacky because you could escape from the quote, as in
Feather.our_helper "foo/*.ml\" or <**/*.ml> or \""
which would translate toGlob.parse "\"foo/*.ml\" or <**/*.ml> or \"\""
. Might still be useful?
Yeah, I didn't notice this either before. I don't think any user will make use of something a little obscure like boolean logic in the context of Feather since it feels quite alien to what we're used to in shell, so IMHO using path_glob is not worth the troubles.
All in all, I feel like Re.Glob is the best solution available, as a battle-tested, portable library supporting **
. I don't think the dependency size is a problem here, but again if needed I can try to do some benchmarks as we did with Base/Core.
That all makes sense and I agree Re would be a fine dependency to have.
I am a little unsure of the actual behavior we'd want to have. I'm thinking this could be a type:
val glob : ~cwd:string -> string -> string list
and then someone could pass the same cwd they might pass to "run":
let cwd = "tmp" in
process "ls" (glob "*.txt" ~cwd) |> run ~cwd
but that has the problem that people have to remember to pass ~cwd to both for it to make sense.
Also, I'm imagining this would ls
the particular directory and result in a list of files that match with the files in that directory. That works fine for *
, but we'd want to know whether to use find
for **
. Things get a bit weird with {}
too: you would only be able to match against files that are present in the filesystem. So something like this would not work even though you might expect it to:
process "cp" (glob "{a,b}.txt") |> run
because the destination b.txt
doesn't exist yet.
I'm not really sure what the takeaway is. Maybe we build something ourself or wait for peolpe to ask for it?
For the interface, I'm thinking of another way that may be more practical for both us and the users: do not expand the glob operators until the pipeline is run (and only if activated when running).
For example
let cwd = "tmp" in
process "ls" [ "*.txt" ] |> run ~cwd ~glob
would expand and flatten *.txt
. It seems easy enough to implement since we already have a convient tree that can be explorer or just expand on the fly.
I'm seeing two advantages to this :
cwd
twiceprocess "gcc" [ "-c"; "*.c" ]
can be expanded as ["-c"; "a.c"; "b.c"; "..."]
if the pipeline is run with globing enabled.The one disadvantage I see is that if we enable globbing for the whole pipeline, we cannot disable it locally (this could be worked around with a ~no_glob
flag for process tho)
For the semantics, I think we should take from Bash:
*
globs on files in the working directory (or sub-directory of the parent path), as you say{a, b, c}
expands in just strings, without regard for the file-system, like in Bash**
is a bit more tricky yeah. If there is a **
in the pattern, find
could be used.I think all of this can be achieved by taking a close look to the possible flags in Re.Glob such as path_name
, anchored
and expand_braces
.
I'm not really sure what the takeaway is. Maybe we build something ourself or wait for peolpe to ask for it?
No matter what this feature is not urgent in any way so no need to rush a solution or come up with a solution at all!
That makes sense! I much prefer the run ~glob
interface you suggest. Judging by its interface, I'm doubtful Re.Glob will help with expand_braces because it needs a set of things to match against. So I think no matter what Feather would have to do some heavy lifting. I probably don't see myself doing that implementation in the near future but I'll mark this as "help wanted" to see if anyone comes along!
I'm not sure whether globbing should be a first class feature or a sibling library or maybe there is already a library that handles it well in OCaml but it'd be nice to have some pre-thought-out answer about it.