JuliaLang / Juleps

Julia Enhancement Proposals
Other
67 stars 24 forks source link

[WIP] Using types and not strings to represent Paths #26

Closed oxinabox closed 2 years ago

oxinabox commented 7 years ago

I started writing this over 4 months ago; but life got in the way. Its now in a state to take feedback from others. It could really do with it, I'm sure.

The current way path are handled as strings hasn't changed much since it was written, mostly in https://github.com/JuliaLang/julia/commit/6f9fb22eccf7ecbcba9158f19bb24985623b3ca4 in January 2013.

Proposal Abstract:

Add a AbstractPath type and deprecate open(::AbstractString) in favour of open(::AbstractPath) AbstractPaths allow code to be written without caring where or how the data is stored. Using types for paths allow us to enforce some validity and constancy rules. This also allows for multiple dispatch differentiating between a Path to a file, and that files contents as a string.

TODO

running todo list of things to add/change in the julep before it can stop being WIP

timholy commented 7 years ago

It has a slightly different emphasis, but I'm surprised not to see a reference to FileIO.jl here. I imagine your AbsolutePath and RelativePath could be added there. I do think that FileIO is the de facto standard, so I'm not yet convinced that this has to be done in Base. Of course, adoption rates may vary in different niches within the package ecosystem.

oxinabox commented 7 years ago

I was actually thinking of FileIO.lj when I wrote this, I haven't yet wrote anything about it into the Julep. My thoughts are that this Julep and FileIO.jl are highly complementary.

To get information out of a file (or file-like object) , you need two things

I can see how this could fit into FileIO.jl, but it would be a widening of scope from FileIO.jl's current mission: "FileIO aims to provide a common framework for detecting file formats and dispatching to appropriate readers/writers"

I see these as two separate but supporting tasks. I am glad you brought this up, I'll make note to improve the julep with a extended version of this comment.

tkelman commented 7 years ago

See also https://github.com/Rory-Finnegan/FilePaths.jl

mbauman commented 7 years ago

I definitely see the case for adding path-like objects, but I'm not as certain about deprecating open(::AbstractString). It sounds like this is just a case where package authors need to widen their signatures. Would it solve your use-case if we generally advocated for methods to be written like f(::IO, ...) and f(x, ...) = open(io->f(io, ...), x)? If not, maybe you could add a bit more motivation for that change.

oxinabox commented 7 years ago

@mbauman maybe i am wrong wrong about the need to deprecate open(::AbstractString)

Maybe my instincts are wrong because I am not used having multiple-dispatch. And perhaps because of my dislike for TIMTOADI You are right that it package authors are used to PRs that ask them to widen their type signatures.

WRT encouraging the use of IO to match filenames: It is a partial fix (and something we should be doing anyway) but issues remain:

tkelman commented 7 years ago

The place where this would be a most valuable addition at this time is working towards a "virtual filesystem" abstraction which could then be used for code loading on remote nodes.

StefanKarpinski commented 7 years ago

Apparently Racket does this and people have told me that it's a big win. Maybe check out what they do?

c42f commented 7 years ago

I'm very positive about this julep.

@mbauman - can you describe in more detail the practical reasons for keeping open(::AbstractString)? To me, the reasoning given in this julep is compelling and deprecating this function seems worthwhile.

As a point of comparison, I've always found the roughly equivalent thing in C++ (boost::filesystem::path) to be irritating, intrusive and more effort than it's worth. I think this may be because:

  1. There's a huge weight of older code which assumes paths are strings, so you're constantly having to convert them. (Addressed here with the deprecation.)
  2. Writing path literals is a pain. (Addressed here with string macros.)
  3. The added functionality is not huge. (Addressed here with the improvements to multiple dispatch.)

To expand on point 1, deprecating open() for strings is also important so that the entire julia ecosystem can get on with using AbstractPath earlier rather than later. If not, I think we might end up with some packages preferring strings (if nothing else due to historical inertia from other languages), others preferring AbstractPath, and still others allowing both, by widening their type signatures to Any in places.

On the subject of literals, a thorny question: Should @p_str return a platform native file path, or should it somehow be a platform-independent syntax? On the one hand, it seems important for REPL usage that platform native paths are supported. On the other hand, for relative paths in packages, platform independence is fairly important. In both cases, path literals may be desired.

MikeInnes commented 7 years ago

I think Rust does this as well. They have something like a .to_path() trait on strings so you can use the basic apis as normal, so if we do a similar thing it doesn't have to be breaking. String macros can make this nice too, e.g. open(path"C:\foo").

mbauman commented 7 years ago

@mbauman - can you describe in more detail the practical reasons for keeping open(::AbstractString)?

It's a large deprecation, open("path/to/file") is the obvious thing and shared across many languages, and having open("…") mean something different would be surprising and subtle. From what I can see in their docs, both Racket and Rust allow either paths or strings in their open equivalents.

That's not to say deprecating open(::AbstractString) definitely shouldn't happen, but it needs to be well-motivated to overcome those trade-offs.

c42f commented 7 years ago

Yes, I'm not so convinced about having open() with a string as the contents, just due to historical inertia from many languages. But regardless of that, the first step of making it work only with paths is appealing to me.

oxinabox commented 7 years ago

Yes, I'm not so convinced about having open() with a string as the contents...

I see I was unclear here. I do not suggest that open(str::AbstractString) become what is currently open(IOBuffer(str)). I suggest that user functions like in


function foo_process(content::AbstractString)
...
end

#can co exist with:
foo_process(io::IO) = foo_process(readall(io))
foo_process(filename::AbstractPath) = open(foo_process, filename) #using open(::Function, ::AbstractPath)

But to accomplish this open(::AbstractString) is to be deprecated in favor of open(::AbstractPath) just so that package maintainers get that kick to tell them to update all there foo_process(filename::AbstractString) type functions that take a filepath, to take an AbstractPath. (Of course they could be contrary, and add the path conversion in their code inside their code for foo_process(filename::AbstractString)

I suggest that open(::AbstractString) be deprecated without anything taking its place. Maybe even left deprecated forever, with a like to the how julia is different to other languages documentation page.

I feel like adding the deprecation to open would be a faster way to get changes made. The plan still works if open doesn't get deprecated. Particularly if functions like joinpath is deprecated for join(::Path...), and other such. Which is useful in and of itself for handling system differences.

c42f commented 7 years ago

Sounds perfectly sensible.

What do you think about the return type of @path_str? Perhaps we could designate the posix syntax as the standard for relative paths, and have path"foo/bar" return a generic RelativePath type, which would then be usable as a relative path literal across different operating systems? Joining a PosixPath with a RelativePath would obviously then result in a PosixPath, and likewise for windows paths.

What about absolute paths, and how does this play nicely across systems where users will want to use the native path format in the REPL? Some systems adopt a platform-independent standard for writing path literals (eg, cmake, I think). The only libraries I recall parse strings as native paths, which is operating system dependent.

Side note - this stuff should work really nicely as a hint for tab completion.

StefanKarpinski commented 7 years ago

I think we need to identify what the concrete advantages of path types are and then determine what we need to do to get those advantages. I'm not convinced that disallowing strings as path arguments is necessary to get the advantages. But then again, we're still a bit hazy on what the advantages are precisely, so clarifying that needs to be the next step.

simonbyrne commented 7 years ago

It would be useful to link to what other languages with similar ideas have done (and, ideally, the reasons why they made those choices). For example, the rust RFC is here and points out many interesting issues (e.g. the possibility of unpaired UTF-16 surrogates in Windows paths).

rofinn commented 6 years ago

I've been using FilePaths.jl for a while now and here are some notes from my experience.

Advantages:

1) Being able to dispatch on a path type (vs string) is really nice. 2) Adding an extra character (e.g., p"~/.julia/v0.6/FilePaths/") isn't a big issue and often helps with code readability if I'm just seeing p"FilePaths".

Disadvantages:

The main issue I ran into was that interop with other packages can be a bit annoying. I was always writing String(path) so I opted to subtype AbstractString for practical reasons, but this can often introduce method ambiguities when porting existing code. If a path type was provided in base and more widely used I could see going back to not subtyping AbstractString, but for now this seems like the best middle ground.

Overall, I think having a minimal file system path type hierarchy in base with appropriate string conversions would be a good step forward.

simonbyrne commented 6 years ago

I'm broadly in favour, but would prefer a non-single letter macro (perhaps path""?)

rofinn commented 6 years ago

Hmmm, I was mostly wanting to mimic r"^[a-z]*$"/Regex("^[a-z]*$") in base and if the macro name is too long it kind of defeats the point from my perspective (e.g., Path("FilePaths") is only two more characters than path"FilePaths").

StefanKarpinski commented 6 years ago

So far the only advantage cited here is that "being able to dispatch on a path type is really nice". The fact that p"/path/to/file" is only a character longer than "/path/to/file" is not actually an advantage – it's an absence of much disadvantage. I have a gut feeling that there might be real advantages here, but if there are, they're not being conveyed very effectively.

rofinn commented 6 years ago

FWIW, my view was just that we have DateTime instead of Int, Regex instead of String and IPv4 instead of Int because they're distinct concepts that have specific rules associated with them that do not apply to the more general C-ish representations (e.g., match(::Regex, ...) makes more sense than match(::String)). Similarly, basename and parent don't really make sense for strings, but they do for filesystem paths. Looks like this is pretty much the same argument proposed for pathlib being included in the python stdlib. NOTE: I could also see an argument for having a URI type in base for similar reasons.

I can't think of any "real" advantages apart from having a type that is distinct from a general string for representing filesystem paths feels a bit more ergonomic and has helped me avoid a few bugs... but that also summarizes why I'm not writing my code in C :)

c42f commented 6 years ago

Yes, it's capturing the semantic that paths are a "different kind of thing" which makes this interesting. Being able to use dispatch effectively is the most obvious sign that this might be worthwhile. Here's some minor advantages related to literals:

But these advantages are a bit of a sideshow, I think.

Perhaps some concrete use cases might be helpful. Here's a contribution from me (apologies that it's not fully concrete, it reflects work I've done, but more in C++ than julia).

Say I want to write code which passes around either S3 URLs or file paths pointing to some point cloud data. I don't want to open the resources right away, so I need to pass around something which is an address for the data, which is a perfect use for AbstractPath. Eventually I want to pass my AbstractPath to a hypothetical ThirdPartyPackage.jl which calls LasIO.jl which opens the stream using open() and reads some point cloud data in LAS format. To get this to work, all packages need to agree that AbstractPath is the right thing to use, or fall back to Any for the resource names.

simonbyrne commented 6 years ago

Other arguments:

  1. It distinguishes between cases where files and strings are both valid arguments. One case I came across recently was in SHA.jl, where sha1(::String) hashes the data in the string, but to hash the contents of a file you have to do SHA.sha1(open("filename")): this is different from similar functions such as Base.read. Similarly, we could use include(::String) instead of include_string.

  2. We can leverage dispatch for different types of paths, e.g.

    include(path"filename.jl")
    include(URI("www.example.com/run.jl"))
    include(GitPath(repo, commit, pathinrepo))

    could be defined, and then have recursive include calls work by overloading joinpath.

StefanKarpinski commented 6 years ago

The distinction between String as data and String as location seems significant. The irregularity between path"filename.jl" and URI("www.example.com/run.jl") doesn't seem great. I could see url"www.example.com/run.jl" or go the other way and use functions for all of the above.

Other advantages that I might hope for with path types:

mauro3 commented 6 years ago

Could this make ~/somefile and somedir/*.jpg work as expected?

c42f commented 6 years ago

Yes it seems like p"~/somefile" could generate a sensible type for a path relative to the user's home directory. This would be super useful when combined with tab completion.

The glob version p"somedir/*.jpg" could generate a GlobPath or some such type, which would also be very nice. Julia is already remarkably slick at calling external processes, but this would make it even better. I think globbing is mentioned in passing in the original text of the julep as well.

vtjnash commented 6 years ago

I don't think that's necessarily connected to using types. You can do that already:

using Glob
readdir(glob"somedir/*.jpg", expanduser("~") #= aka homedir() =#)
c42f commented 6 years ago

Good point. Though manually having to type expanduser is not quite the slick experience you might hope for if you're used to path expansion in the shell.

[edit: TBH I've tried using the literal "~/blah" out of habit before, been unsurprised that it doesn't work, and looked no further. Path literals potentially give us the opportunity to make this "just work" for users.]

twolodzko commented 3 years ago

Check the issue linked above. It makes similar proposal.

Adding to what was said, beyond path literals, I propose having / method for joining the paths, so that they feel almost like system paths and are instantly readable: p"/home/username" / var_directory / "file.txt", like in Python's pathlib does.

Additionally, pathlib has some extra functionalities like iterating over the path parents, iterating over files within paths, wildchart paths etc., worth considering.

rofinn commented 3 years ago

FilePathsBase.jl already provides that functionality, though I don’t think the optional division operator overloading would make it into base.

https://github.com/rofinn/FilePathsBase.jl/issues/53

twolodzko commented 3 years ago

@StefanKarpinski mentioned ++ as path join syntax, / is more consistent with how we write paths, so seems to be more self-explanatory for the user.

rofinn commented 3 years ago

I agree, which is why I used it in FilePathsBase. I think the issue is just that we don't want to have an operator like / mean two completely different things. Also, / is a pretty unix centric choice :)

Discussion about different operators here https://github.com/rofinn/FilePathsBase.jl/issues/2

vtjnash commented 3 years ago

It's just logical: / is the file path divider

oxinabox commented 3 years ago

What we are actually doing when we write A/B is forming the quotient set of all paths with parent A, declaring equivelence as to if they further have the parent B or not, then we enter the element of quotient set which was for the ones that do have parent B and consider them as if we had not applied the equivelence.

tl;dr; / is just a set quotient operation on the set of all filepath parents.

twolodzko commented 3 years ago

@rofinn don't agree about /.

First, yes it is Unix convention, but people nowadays more often than not use Unix-like systems on their personal computers (Mac OS, even under Windows you can use Bash shell, or even Ubuntu as a "software"), or when working remotely (computational server, cloud computing, Docker etc), also URL's use this convention, so everyone seems to be familiar with it.

Second, currently * is used for concatenating strings. Honestly, I found * in Julia to be a strange choice, why would / be more "meaning two different things" than *? Using * for paths would be confusing (for me), since with paths we don't just concatenate them, but use joinpath that normalizes them. ++ proposed by @StefanKarpinski is used in Haskell for concatenating strings, so for combining string-like path objects it can be considered as confusing as well. Also, it's two characters in place where we could use one, and for strings most of the operators are not overloaded yet.

So / is simple and intuitive for paths. People coming from Python would find it as an almost instant replacement for pathlib functionality. Other users should find it similar to the system path, or URL separators.

rofinn commented 3 years ago

Again, that's largely why I opted to use / and keep it as an option. It just isn't available by default and I don't think it belongs in a base/stdlib implementation. I think it at least requires a using FilePathsBase: / from the end user to make it explicit what the syntax does.

StefanKarpinski commented 3 years ago

Being fastidious about not punning on operators is a pretty core Julian principle. It's fine if people do it in their own code, but mixing up "divide" and "concatenate this path with this other path" in one generic function is not really cool.

tpapp commented 3 years ago

Second, currently is used for concatenating strings. Honestly, I found in Julia to be a strange choice

Some people do, but the choice is made now (there is even a FAQ about it), so simply using it for paths would be somewhat consistent, as pretty much all of the arguments apply in a similar way to strings.

rofinn commented 3 years ago

The problem with reusing * is that then we can't use it for normal string concatenation:

p"foo" / "bar" / "baz" * ".txt" == p"foo/bar/baz.txt"
StefanKarpinski commented 3 years ago

Another possible approach is to allow interpolation into path strings with / in the string meaning path separator. I.e. p"/home/$user/$dir/$name.txt" would mean joinpath(Base.Filesystem.path_separator, "home", user, dir, "$name.txt"). An interesting question in that case would be what should happen if dir is an absolute path? Note that joinpath operation would discard the /home/$user part in that case. Of course, I don't think this problem is specific to the approach of writing p"/home/$user/$dir/$name.txt": the same question exists if you write p"/home" / user / dir / "$name.txt", in which case it also seems more surprising if dir being absolute caused you to get a path that didn't start with /home/$user than if you used joinpath.

oxinabox commented 3 years ago

That seems interesting and I would need to think about it more.

But one key lack is that it can't be passed as an input to a higher order function and it can't be broadcast.

I often broadcast joinpath. (Though probably less now that readdir has an option to give a full path)

StefanKarpinski commented 3 years ago

I often broadcast joinpath.

😬

tpapp commented 3 years ago

The problem with reusing * is that then we can't use it for normal string concatenation:

Not if you disallow * on mixtures of paths and strings, and require that operands are made into paths instead, or chose semantics so that path * string is concatenated (without path separators) while path * path is joined with path separators. Eg the latter would be

p"foo" * p"bar" * p"baz" * ".txt" == p"foo/bar/baz.txt"

Personally, I am OK with joinpath.

StefanKarpinski commented 3 years ago

I think the most reasonable path (😬) forward would be:

Path literals can also have features like making it easier to write Windows paths when you have to.

oxinabox commented 3 years ago

Path literals can also have features like making it easier to write Windows paths when you have to

Hasn't windows accepted / or \ since like windows XP or something?

StefanKarpinski commented 3 years ago

I thought there were situations where you need to use \ such as when specifying a drive? If not, then we can just require /.

ararslan commented 3 years ago

UNC drives on Windows require \. There may be other cases as well, though / works most of the time.

StefanKarpinski commented 3 years ago

The path literal approach is very flexible: if a path literal starts with a valid UNC drive sequence, then it can allow single backslashes in the rest. Another reason we may want to allow p"C:\path\to\blah" syntax is that it matches what gets printed in a lot of places, including ones that we don't control. Another thing to consider is that it may be fine to not have any escape syntax in path strings: putting sequences that require escapes in paths is very rare and if you really want to do it, you can always interpolate a string.

c42f commented 3 years ago

I've been working with abstracting data location recently (see DataSets.jl) and I've noticed anew that there's a really big difference in the genericity of relative vs absolute path types.

However I'd observe that portable code likely gets the path root from somewhere programmatically and rarely needs absolute paths. From this point of view, a relative path literal would be fine, especially if it could incorporate a few things like tilde expansion.

Alas, doing away with absolute path literals is not going to satisfy anyone who wants to write a quick script unless we've got a compelling replacement. For system dependent stuff, perhaps we could have winpath"C:\foo\bar" and posixpath"/foo/bar" etc.

What can you do with an abstract absolute path?

In generic code which takes AbstractAbsolutePath, you can

But other than that, I don't think it's clear what you can do! There's some other contenders for generic verbs but they have their problems

If you think about open(path) long enough, you'll realize another problem: there's more than one way to reflect an abstract resource into the program as a Julia type. Even for normal file paths, you have the options in open() vs mmap(). In DataSets.jl, I'm experimenting with open(T, path) (where path::DataSet) when trying to attack this problem.

vtjnash commented 2 years ago

I think the suggested path forward here is https://github.com/rofinn/FilePaths.jl. We aren't going to make breaking changes to the file system functions in base to stop using string, and I think it does make the most sense for the primary API to be strings, but with the option for the user to layer a more advanced type on top (particularly for more complex cases such as non-local resources)

But this could be a discourse post or discussion here, if we want to continue with the julep proposal written here.