Open GiggleLiu opened 3 years ago
The maximum filename length is 255
It's actually system dependent. I believe that silently returning false can cause inexplicable bugs.
I suspect this issue isn't really about the maximum filename length on a particular file system, but about the question whether/when isfile
(and friends isdir
, islink
, etc.) should raise an exception in Julia.
The POSIX stat system call can fail for various reasons, including EACCES, EIO, ELOOP, ENAMETOOLONG, ENOENT, ENOTDIR.
The docstrings of the Julia functions isfile and stat are currently completely silent on what result which of these errors lead to.
This may be a problem because the equivalent functions in other languages do not raise an exception, and are in fact frequently used to check if a file exists.
For comparison: the Perl/Bash equivalent -f
does not raise an exception if called with a gigabyte long filename, i.e. both ENOENT and ENAMETOOLONG result in false:
$ perl -E 'say -f ("a" x 1e9)'
In Julia, ENOENT results in false
and ENAMETOOLONG in IOError: stat: name too long (ENAMETOOLONG)
. At a minimum, this should be documented. I can see arguments that more errno
s than just ENOENT
should cause isfile
to return false
.
I tend to agree with @mgkuhn . I recall most of my using cases are about using isfile
to avoid exceptions. It is not expected to return an error just because the name is too long. Just to add one API to fix: ispath
.
Also, another important factor is: is there an easy way to check if the filename is too long?
Looking at the source of stat, this just calls jl_stat which calls uv_fs_stat, none of which say anything about exceptions that they can raise.
POSIX offers the function pathconf to query path-specific limits on the maximum filename and path length permitted by a file system.
Libuv has uv__fs_pathmax_size() to query _PC_PATH_MAX
that way, but apparently no equivalent yet for _PC_NAME_MAX
. Since Julia tends to do all file I/O through libuv, I think libuv would have to be extended first before equivalent functions could be added to Julia. Or is there some package that just imports all POSIX functions (for users interested in such more low-level queries and less concerned about Win32 portability)?
I don't know a Win32 function to query maximum filename length, and limits there tend to be be measured in UTF-16 words, and not in UTF-8 bytes. So "255" can mean different things!
I just tried to attack Pluto server (locally) by inputing a too long filename in the path, Pluto is lucky and this error does not break down the server. Not sure every web application can be so lucky under such an attack. :smirk:
Too many people are assuming isfile
not raising an error.
is there an easy way to check if the filename is too long?
Not sure if that qualifies as “easy”:
istoolong(filename) = try
stat(filename)
false
catch e
if isa(e, Base.IOError)
e.code == -36
else
rethrow()
end
end
istoolong("a"^800)
is there an easy way to check if the filename is too long?
Not sure if that qualifies as “easy”:
istoolong(filename) = try stat(filename) false catch e if isa(e, Base.IOError) e.code == -36 else rethrow() end end istoolong("a"^800)
This is good. I just want to know if people want to have a safe_isfile
, how can they implement one. Thanks
So the question is: in base/stat.jl should the functions
ispath
isfifo
ischardev
isdir
isblockdev
isfile
issocket
issetuid
issetgid
issticky
islink
catch exceptions that stat
/lstat
can raise and return false
instead?
If yes: all exceptions that stat
/lstat
can raise, or only some? Which?
Which exceptions can stat
/lstat
raise at present?
Bump.
EDIT: Below is trivia of MAX_PATH, an issue I want fixed, and then below on NAME_MAX, that I can confirm is 255, practically speaking.
https://insanecoding.blogspot.com/2007/11/pathmax-simply-isnt.html
The flaw is that PATH_MAX simply isn't. [..] Since a path can be longer than PATH_MAX, the define is useless, writing code based off of it is wrong, and the functions that require it are broken.
An exception to this is Windows. It doesn't allow any paths to be created larger than 260 characters. [That's also wrong.]
It sounds strange that such a small limit was chosen, considering that FAT has no such limit imposed, and NTFS allows paths to be 32768 characters long.
Can't we assume at least 32768-long paths allowed on all filesystems/operating systems we want Julia to support?
https://www.gnu.org/software/libc/manual/html_node/Limits-for-Files.html
Macro: int PATH_MAX The uniform system limit (if any) for the length of an entire file name (that is, the argument given to system calls such as open), including the terminating null character.
Portability Note: The GNU C Library does not enforce this limit even if PATH_MAX is defined.
EDIT: I misread the issue here about path length, something I very much want to fix to not be limited to 256/260, 1024 or whatever, which seem possible even on Windows (without the registry hack), but to stay on topic:
The maximum filename length is 255
Seemingly never longer, but sometimes shorter (all exceptions we can Ignore, i.e. 14 in System V, and 8.3 in FAT, i.e. VFAT ok?):
https://www.baeldung.com/linux/bash-filename-limit
BTRFS | 255 bytes exFAT | 255 UTF-16 characters ext2 | 255 bytes [..] FAT32 | 8.3 (255 UCS-2 code units with VFAT LFNs) NTFS | 255 characters
And:
https://discussions.apple.com/thread/555198
Under OS 9 a maximum file name length was 32 characters while under OS X it is 255 characters. However, some applications don't allow file names in the save dialogs that are longer than 32 characters. That is a problem with those applications, not the system.
and:
For example, {NAME_MAX} frequently changes between System V and BSD-based file systems; System V uses a maximum of 14, BSD 255
The obligatory Wikipedia summary post https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits
I recently ran into an issue where I got hit with ENAMETOOLONG for a path that is 180 characters. I'm on Linux Mint, btrfs filesystem, which should support <255 filename length just fine. My string did contain special characters, and it wasn't an absolute path. So I'm guessing somewhere or other, my string ballooned to >255 characters? Point is, simple length checks seems brittle. Why not just rely on the error code, and handling it in a way that doesn't surprise users?
To my thinking the path/segment length issue is moot.
isfile()
simply answers the question, "does this path resolve to an existing plain file". If the path is illegal for the specific filesystem then clearly the file cannot exist so it should return false
.
The reason for the failure is not relevant to the purpose of isfile()
, and the fact that it uses a library call that throws an IOError is an implementation detail that should not be allowed to leak.
Instead of trying to test if the path is legal which is "complicated", just try to stat
it and swallow all errors from the filesystem and return false
.
As noted above a number of other similar functions probably should be made consistent.
I'll have a look. There seem to be additional problems with error handling in stat
-related APIs, e.g.
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, ivybridge)
Threads: 1 on 8 virtual cores
julia> stat("/nonexistantfile")
StatStruct for "/nonexistantfile"
size: 0 bytes
device: 0
inode: 0
mode: 0o000000 (----------)
nlink: 0
uid: 0 (root)
gid: 0 (root)
rdev: 0
blksz: 0
blocks: 0
mtime: 1970-01-01T01:00:00+0100 (19548 days ago)
ctime: 1970-01-01T01:00:00+0100 (19548 days ago)
doesn't look like a robust error-handling mechanism either.
Shouldn't stat
instead return nothing
if the requested file does not exist? (Changing that would not seem to be a breaking change, as the current docstring says nothing about error behaviour.)
Shouldn't stat instead return nothing if the requested file does not exist? (Changing that would not seem to be a breaking change, as the current docstring says nothing about error behaviour.)
Wow, that's very surprising. Probably worth raising as a separate bug.
To my thinking the path/segment length issue is moot.
isfile()
simply answers the question, "does this path resolve to an existing plain file". If the path is illegal for the specific filesystem then clearly the file cannot exist so it should returnfalse
.The reason for the failure is not relevant to the purpose of
isfile()
, and the fact that it uses a library call that throws an IOError is an implementation detail that should not be allowed to leak.Instead of trying to test if the path is legal which is "complicated", just try to
stat
it and swallow all errors from the filesystem and returnfalse
.As noted above a number of other similar functions probably should be made consistent.
I mostly agree, except for the case of permission issues. Given a folder /foo which I cannot access, and a file therein called bar, I cannot stat /foo/bar even though it is in fact a file. So really the return type should be Boolean|Missing
Given a folder /foo which I cannot access, and a file therein called bar, I cannot stat /foo/bar even though it is in fact a file.
The permissions you describe are set so you are not supposed to be able to see the file. It would be a security leak if isfile()
indicated the existence (by returning anything other than false) when you are not permitted to see the file.
I understand the security implications, but I don't think returning false
from Julia's isfile()
really helps. For one thing, the user can stat
the file directly to get the error code in question without worrying about what Julia does with it. For another, the error code only says that some directory in the path of the given string does not give the user execute permissions. The user already knows that directory exists (because it was able to list its parent). Ie, If the path in question is /foo/bar/biz
, the user can list the foo
directory, then the user will know that bar
exists. But if the bar
directory does not grant the user permissions to list it, the user gains no info about whether biz
file exists; the error just indicates that some directory is not accessible by the user. So basically missing
is the correct return value, because Julia is unable to confirm or deny the existence of the file, only that it looks like a path as far as it can tell.
True, nothing that uses stat()
can be more secure than stat()
is, so yes bar
must not be searchable by the user, and stat()
will return EACCESS
.
What about all the other reasons that the file cannot be resolved?
On Linux stat()
(Windows error codes are different IIUC) EACCESS
, ELOOP
, ENAMETOOLONG
, ENOENT
, ENOTDIR
all relate to all components of the pathname, but Linux stat()
gives no information which component, isfile()
would have to walk the path to find which, so the status of the file is indeterminate unless that is done, so IIUC your argument all stat()
errors should return missing
. But in all cases as far as the user is concerned the path does not resolve to a file, so isfile()
can also legitimately return false
.
As you say, if the user wants more information they can use stat
directly, so the only reason for isfile()
existing is to just to make simple usage simple, eg:
if isfile(path)
do_something(path) # ignoring potential race, its "simple" after all
end
But returning missing
will mean that simple usage may now throw TypeError
from the if
IIUC.
your argument all stat() errors should return missing.
Not what I was saying. I only meant that stat
returning an empty result for files that don't exist seems like a bug. Sorry for being unclear.
As for isfile
and friends, I was only suggesting EACCESS
map to missing
. ELOOP
, ENAMETOOLONG
, ENOENT
, and ENOTDIR
all seem like they would map to false
IMO.
The other one I'm unsure about is EIO
:
An error occurred while reading from the file system.
It's a bit unclear what that means, but it sounds like it could be raised even if the path does in fact refer to a file. Eg, if it's on a network mount that's not responding, it might or might not be a file and we just don't know.
As for your point about ending up with a TypeError
... yeah that's not great. I guess the three options are: return missing
, return false
, or throw an exception (eg, rethrow the IOError
). I imagine most of the time users would want to just get a false
back from the function, but it could cause confusion in some cases.
My question then is why the special pleading for EACCESS? What is the general use-case for it for simplified functions like isfile()
?
If you return false
for an EACCESS
issue, it is inaccurate and potentially misleading. Consider:
daemon_pid_fname = "..."
isfile(daemon_pid_fname) || spawn_daemon() |> persist_pid
...
If we mask the EACCESS
issue in isfile()
, then we won't notice it until we try to persist the PID of a new daemon (which would presumably throw some IOError
during some open()
call inside persist_pid()
), then in this case we'd end up with an extra daemon.
But why is EACCESS
special, any error from the stat()
inside isfile()
could cause the same problem, a dangling link, a name too long (the OP), and intermittent NFS error etc.
And as my comment in the example alluded to, there is a race in your example, another process (or thread) could create the daemon between the test and the spawn. Simple isfile()
isn't the way to handle that, it should be performed properly using locking.
IMHO complicating the simple to incorrectly handle the complex is the wrong thing to do.
Yeah, ironically my example had misleading characteristics. The concurrency/race condition is really not the point though. The point is misleading the caller that the file does not exist, acting on that false knowledge, but the file did actually exist--we just couldn't access it. If we were discussing is_file_which_does_not_throw_errors()
then sure all IOError
s should be treated the same.
The point remains that if something like ENAMETOOLONG
is thrown, we know for sure that the input argument cannot possibly be a file; but if we get EIO
or EACCESS
then we really don't know. Making assumptions or otherwise pretending to know could be dangerous.
So to recap the options:
false
if and only if we can be sure that the argument does not refer to a file.false
if we have direct evidence it's not a file, or we are not sure.I'm in favor of options 2 (preferred) or 4, but mostly I'm just very against option 1.
isfile(path)
is one of a number of predicates which are convenience queries on the result of a stat()
call. They are all defined as "Return true
if ..., false
otherwise".
Since all they do is test the return of stat()
their only reason to exist is to be convenient. They are all defined as:
($f)(path...) = ($f)(stat(path...))
The OP was that isfile()
unexpectedly threw on name too long, and its clear that the other convenience queries on stat()
will do the same given their equivalent definitions.
The OP commented "I recall most of my using cases are about using isfile to avoid exceptions.", a reasonable expectation since no exceptions are documented for stat()
or the queries, and I suspect most uses of the queries is to avoid the complications of stat()
in simple cases. Since all the queries are racy in complex cases, they should only be used in simple use-cases anyway. For simple uses avoiding exceptions is reasonable.
@mgkuhn to answer your question "If yes: all exceptions that stat/lstat can raise, or only some? Which?"
AFAICT UV error codes are converted to exceptions here for any code other than ENOENT, EINVAL, and ENOTDIR. It will also throw if it detects an inconsistency in the stat()
return, success but no type.
Those conditions seem arbitrary IMO, eg why does EINVAL not throw, but ENAMETOOLONG does, both seem to me to be input parameter errors, and both ENOTDIR and ENAMETOOLONG both seem to be path errors but ENOTDIR does not throw. Since these tests are explicit perhaps someone can comment on why it is that way.
Perhaps a @stat_nothrow_call
is needed that does not convert UV error codes to exceptions so predicates like isfile()
can avoid throwing and return false
when they cannot confirm the truth of their query. Rather than changing the basic stat()
, which should however be documented as to what conditions can throw and what the reasoning for that is.
@nstiurca ok, understand your point, but isfile()
and the other queries are defined as "Return true if (path is a regular file), false otherwise." so you are requesting a feature change (or new functions) that return true
if the test is definitely successful, false
if definitely not successful, and missing
if not sure, is a different question to that raised by the OP and probably should be a separate feature request.
Also returning missing
reintroduces the risk of TypeError
throws from if
when many use-cases of the current predicates are to avoid throwing.
@elextr Soooo... you support option 3 from my last comment as your preferred resolution? It's better than before, particularly if well documented. I can live with that.
As for "feature change"... we're dealing with undefined behavior (the definition doesn't mention exceptions or cases where the call is uncertain. As such, fleshing out the definition isn't technically a breaking change even if it's a change. Point being we can get this change into the next point release if we can get to some consensus on what's the right definition.
The maximum filename length is 255,
isfile
should return false rather than raising an error when the file name is too long.The same applies for
isdir
relevant code: https://github.com/JuliaLang/julia/blob/a163d3e057941a3a9dd0d27a13ac01bce8d8d4f1/base/stat.jl#L311
Julia version: 1.6.0-rc1