charlesetc / feather

A shell library for OCaml
MIT License
77 stars 8 forks source link

Story for globbing #13

Open charlesetc opened 3 years ago

charlesetc commented 3 years ago

I'm not sure whether globbing should be a first class feature or a sibling library or maybe there is already a library that handles it well in OCaml but it'd be nice to have some pre-thought-out answer about it.

Firobe commented 3 years ago

Some leads about pre-made solutions:

In my opinion, dune-glob seems like the best solution. It supports not only wildcards but things like **, ?, {}, and [], and it's actively maintained. However it lacks a bit of documentation since it's internal to dune (it is mainly documented here), and it may be awkward to depend on an internal lib (and the interface may change without warning, maybe).

If we only want/need *, the more recent ocaml-glob may be a good solution, or implementing it directly in Feather since it's not that complicated. IMHO, globbing is a nice feature to have and it would make sense to have in Feather rather than in a sibling library.

Firobe commented 3 years ago

Actually there are two other solutions that I didn't come across until now, that are probably better (and on opam):

charlesetc commented 3 years ago

Thanks for doing this research! Some quick thoughts:

ocaml-glob 1: I really like this approach but it doesn't support ** which seems very useful. I wouldn't mind resurrecting it, but again **...

ocaml-glob 2: also doesn't support **

dune-glob would be nice except except its semantics for ** are weird, no?

** matches any character that is not . followed by anything, except if it comes first in which case it matches anything

Re.Glob seems like it provides what we'd want but also might be a large dependency given our existing small set. We can't include Re.Glob without also including Re.Perl, Re.Posix, and Re.Emacs. Maybe it's negligible though. What do you think?

I find the boolean logic in path_glob a little strange, but it could be useful. I'm a little sad that a single glob has to be quoted awkwardly, i.e. this is for boolean logic:

Glob.parse "<foo/**> or <**/*.ml>"

but then for a normal glob:


Glob.parse "\"foo/**.ml\""

I suppose we could write a helper that just adds the quotes around your item to construct something without the boolean language? Seems kind of hacky because you could escape from the quote, as in Feather.our_helper "foo/*.ml\" or <**/*.ml> or \"" which would translate to Glob.parse "\"foo/*.ml\" or <**/*.ml> or \"\"". Might still be useful?

Alternatively we could try to open a PR for path_glob to expose a way to parse out just a Path_glob.Ast.atom, without the Formula.t.

Firobe commented 3 years ago

Thanks for the answer !

dune-glob would be nice except except its semantics for ** are weird, no?

** matches any character that is not . followed by anything, except if it comes first in which case it matches anything

Indeed, I didn't notice that. It would be weird to work around that.

Re.Glob seems like it provides what we'd want but also might be a large dependency given our existing small set. We can't include Re.Glob without also including Re.Perl, Re.Posix, and Re.Emacs. Maybe it's negligible though. What do you think?

While Re.Perl/Posix/Emacs seems quite negligible in size, Re.Core would be the main offender here (necessary to match in Re.Glob). However in practice the loading time seems negligible. I could do some benchmarks to confirm this.

I find the boolean logic in path_glob a little strange, but it could be useful. I'm a little sad that a single glob has to be quoted awkwardly [...] I suppose we could write a helper that just adds the quotes around your item to construct something without the boolean language? Seems kind of hacky because you could escape from the quote, as in Feather.our_helper "foo/*.ml\" or <**/*.ml> or \"" which would translate to Glob.parse "\"foo/*.ml\" or <**/*.ml> or \"\"". Might still be useful?

Yeah, I didn't notice this either before. I don't think any user will make use of something a little obscure like boolean logic in the context of Feather since it feels quite alien to what we're used to in shell, so IMHO using path_glob is not worth the troubles.


All in all, I feel like Re.Glob is the best solution available, as a battle-tested, portable library supporting **. I don't think the dependency size is a problem here, but again if needed I can try to do some benchmarks as we did with Base/Core.

charlesetc commented 3 years ago

That all makes sense and I agree Re would be a fine dependency to have.

I am a little unsure of the actual behavior we'd want to have. I'm thinking this could be a type:

val glob : ~cwd:string -> string -> string list

and then someone could pass the same cwd they might pass to "run":

let cwd = "tmp" in
process "ls" (glob "*.txt" ~cwd) |> run ~cwd

but that has the problem that people have to remember to pass ~cwd to both for it to make sense.

Also, I'm imagining this would ls the particular directory and result in a list of files that match with the files in that directory. That works fine for *, but we'd want to know whether to use find for **. Things get a bit weird with {} too: you would only be able to match against files that are present in the filesystem. So something like this would not work even though you might expect it to:

process "cp" (glob "{a,b}.txt") |> run

because the destination b.txt doesn't exist yet.

I'm not really sure what the takeaway is. Maybe we build something ourself or wait for peolpe to ask for it?

Firobe commented 3 years ago

For the interface, I'm thinking of another way that may be more practical for both us and the users: do not expand the glob operators until the pipeline is run (and only if activated when running).

For example

let cwd = "tmp" in
process "ls" [ "*.txt" ] |> run ~cwd ~glob

would expand and flatten *.txt. It seems easy enough to implement since we already have a convient tree that can be explorer or just expand on the fly.

I'm seeing two advantages to this :

  1. The user does not have to pass cwd twice
  2. It allows to mix glob arguments and normal arguments non-awkwardly, just like in Bash, for example process "gcc" [ "-c"; "*.c" ] can be expanded as ["-c"; "a.c"; "b.c"; "..."] if the pipeline is run with globing enabled.

The one disadvantage I see is that if we enable globbing for the whole pipeline, we cannot disable it locally (this could be worked around with a ~no_glob flag for process tho)


For the semantics, I think we should take from Bash:

I think all of this can be achieved by taking a close look to the possible flags in Re.Glob such as path_name, anchored and expand_braces.


I'm not really sure what the takeaway is. Maybe we build something ourself or wait for peolpe to ask for it?

No matter what this feature is not urgent in any way so no need to rush a solution or come up with a solution at all!

charlesetc commented 3 years ago

That makes sense! I much prefer the run ~glob interface you suggest. Judging by its interface, I'm doubtful Re.Glob will help with expand_braces because it needs a set of things to match against. So I think no matter what Feather would have to do some heavy lifting. I probably don't see myself doing that implementation in the near future but I'll mark this as "help wanted" to see if anyone comes along!