dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.76k forks source link

Directory and file exists methods could be more efficient #22219

Closed JeremyKuhne closed 4 years ago

JeremyKuhne commented 7 years ago

Related to dotnet/runtime#21678

Exists methods have a try-catch for error cases and they normalize, which is not necessary (at least on Windows). As all error cases return false, we can remove most of this overhead.

https://github.com/dotnet/corefx/blob/master/src/System.IO.FileSystem/src/System/IO/File.cs#L181-L210

The only pre-emptive check we might want to keep from GetFullPath is to return false if the passed in path contains an embedded null.

Maybe have a FileSystem.FileExistsFast? This is the current Windows one:

https://github.com/dotnet/corefx/blob/master/src/System.IO.FileSystem/src/System/IO/File.cs#L181-L210

Unix is already in much better state. I assume that crappy and partial (relative) paths will work.

Tests for bad and relative paths are critical, need to validate existing coverage and expand if necessary.

kellypleahy commented 7 years ago

OK... Now I'm massively confused. The reason I got started on this was dotnet/runtime#21678 which started from another issue https://github.com/dotnet/corefx/pull/19716#discussion_r116282222 which started because checking that file exists || directory exists is inefficient. However, looking at this code, it seems to return true for File.Exists when that "file" is a directory.

FileSystem.FileExists does not appear to work that way in the current codebase, nor does FileSystem.DirectoryExists. So, what gives?

kellypleahy commented 7 years ago

Hmm... maybe that comment is just wrong here: https://github.com/dotnet/corefx/blob/master/src/System.IO.FileSystem/src/System/IO/File.cs#L176-L177

JeremyKuhne commented 7 years ago

OK... Now I'm massively confused

Sorry :)

it seems to return true for File.Exists when that "file" is a directory

Shouldn't.

Hmm... maybe that comment is just wrong here:

Looks like it. :)

kellypleahy commented 7 years ago

OK... walking back from the ledge now. Let's see... so.

Are you saying you recommend File.Exists and Directory.Exists directly use FileSystem.Current.FileExists and FileSystem.Current.DirectoryExists as those generally do the right thing and seem like they don't throw exceptions? I forgot, they do seem to require a full path though, right, or is that not generally true? I was wondering about the removing of the trailing \ in DirectoryExists and whether that should really be done or not in FileSystem. Have you seen that bit? I'd be afraid to change that at this point, obviously, as it likely affects a LOT of stuff upstream.

JeremyKuhne commented 7 years ago

This is related to the other (#19717) in that checking for existence period isn't efficient. You have to check twice (#19717) when you don't care if the existing path is a file or directory, which is a pretty common scenario.

Underlying that scenario is this one- if you do care if it is a file or directory, the existing methods do way more work than they need to, notably in Windows. We pre-normalize the path, but the Windows API does that anyway when you get the attributes (which I presume is also the case for Unix). The current FileSystem impl throws, then we catch and turn it to false- which is also inefficient- allocating strings and exceptions then throwing them away. :(

The Unix impl may actually benefit from pre-normalization as it calls multiple APIs. We'd have to validate.

We don't want to change the existing FileSystem.Current.FileExists until we've vetted that nobody cares about the exception state. Simpler to start with a new method and transition a bit at a time I think.

Trimming the trailing separator is still something we'd want to keep.

If you don't want to do this, that's fine, but I'm more than happy to help you through it if you do. I've got enough experience here that I can pretty confidently review your changes. :)

kellypleahy commented 7 years ago

I'd be happy to tackle this one. questions...

It seems like the "common" function I want is something like a (not yet existing) FileSystemEntryType FileSystemEntry.Lookup(string path) with

enum FileSystemEntryType { DoesNotExist, File, Directory, ...? }

that doesn't ever throw on anything but rather just returns DoesNotExist if you don't have privileges to see it, for instance. It also seems like we'd want to use this in CoreCLR for the (not yet existing) Path.Exists. Should I just start with building something like this internal on FileSystem and then figure we'll move it later to CoreCLR as the time permits and api review is completed for Path.Exists and dotnet/runtime#21678?

Also, it seems like we need to follow symlinks in this one, right? Just want to confirm.

JeremyKuhne commented 7 years ago

DoesNotExist if you don't have privileges to see it, for instance.

I've always been conflicted about how to represent the concept in a way that is easy to understand. CanNotDetermine or perhaps both NotFound and AccessDenied? In my old team I actually needed to know whether it was a rights issue so we could request the information from a higher-privilege process- it might be nice to have that available for a future API (perhaps the Path.Exists, even!).

Should I just start with building something like this internal on FileSystem

That is what I would do- much easier to iterate.

Also, it seems like we need to follow symlinks in this one, right?

Yeah, as that is the default historical (Windows) behavior.

kellypleahy commented 7 years ago

@JeremyKuhne I've taken the first steps in addressing this one. Can you have a look over the commit referenced above (kellypleahy/corefx@1a3842c) and see if this is along the lines you were thinking? My general philosophy was to pass through all errors that were easily captured from the underlying system (for free) to the caller as enum values. The enum values that are consistent between the two OSes are reused - some can only come from one or the other OS (for example, Socket can only happen on *nix while InvalidDrive can only happen on Win32).

I haven't done anything to use this new function yet. I assume we'd use it in File.Exists and Directory.Exists as well as the places where the other ticket (#19717) recommends eliminating the double-check for File.Exists(...) || Directory.Exists(...).

Next steps - I don't know how best to test some of these cases. Some of the tests should be straightforward, but for some of them I need to create a path that has a folder in it that is not accessible to the user running the LookupEntry function. I'm not sure if that's possible with the current test structure we have in corefx. Do you know if that's possible to change user during the test and verify things using that? Other cases require a valid and invalid symlink, as well as a cycle of symlinks.

cc @karelz re: testing above ^

kellypleahy commented 7 years ago

also,... might be good to assign this one to me so nobody else grabs it ;)

karelz commented 7 years ago

Your wish was granted 😆, it's now assigned to you. (BTW: Thanks for your help!)

Re: Testing - I don't think we have abilities to switch users in test runs. It is very corner-case scenario and our infra was focused on mainline scenarios so far. I let @JeremyKuhne advice how best to go from here. Maybe having manually-run tests with special runner / prep step (which would create dirs on another account)? Maybe just one-off manual testing?

JeremyKuhne commented 7 years ago

is along the lines you were thinking?

It is a good start. Some more outstanding stuff (to make sure they're in your todo list):

  1. FillAttributeInfo: Remove tryagain. We should only ever make a single attempt at Interop.Kernel32.GetFileAttributesEx, no reason to use FindFirstFile.
  2. Remove the try/catch in File/Directory exists as there should never be an exception after (1)
  3. Remove the normalization (Path.GetFullPath()) and replace with an embedded null check like Path.GetFullPath does.

My general philosophy was to pass through all errors that were easily captured from the underlying system

I think that is fine, as long as we don't surface them in our public API, as they don't align nicely cross-plat. Passing them back internally like this may make it easier to modify/add new API that depends on this helper. I would also make sure that .Directory and .File are set to match what Exists returns- that means either merging the other FileTypes or turning this into a Flags enum. Don't want people accidentally changing the semantics of what a "file" is. I'm currently thinking merging is better than Flags.

Do you know if that's possible to change user during the test and verify things using that?

Afaik we haven't written anything to do this yet. A simpler answer might be to remove file rights (but not ownership), then restore them? We might also be able to find OS files that we can't hit? (Don't know any off the top of my head.)

From this point I'd move towards an official PR for discussion after you make more changes- you can mark it as [WIP] if we're still hashing things out.

kellypleahy commented 7 years ago

From this point I'd move towards an official PR for discussion after you make more changes- you can mark it as [WIP] if we're still hashing things out.

OK cool. How do I "mark it as WIP"? Do I just put [WIP] in the PR title?

JeremyKuhne commented 7 years ago

Do I just put [WIP] in the PR title?

yep

JeremyKuhne commented 7 years ago

@kellypleahy Just a heads up- I've got to make some changes around this area for UAP work. dotnet/corefx#21149 changes the core FillAttributes- I'll drop mentions of any other changes that cross paths with this issue.

kellypleahy commented 7 years ago

Thanks, I got a bit distracted but will be jumping back in this weekend. I can always rebase when you are done with your changes.

On Fri, Jun 16, 2017, 16:01 Jeremy Kuhne notifications@github.com wrote:

@kellypleahy https://github.com/kellypleahy Just a heads up- I've got to make some changes around this area for UAP work. dotnet/corefx#21149 https://github.com/dotnet/corefx/pull/21149 changes the core FillAttributes- I'll drop mentions of any other changes that cross paths with this issue.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/dotnet/corefx/issues/20876#issuecomment-309158174, or mute the thread https://github.com/notifications/unsubscribe-auth/AADP2NIbJZNl-Mw7Dd4n6TaVc-WTQfofks5sEwk_gaJpZM4N1rFh .

kellypleahy commented 7 years ago

I have been very busy lately and haven't been able to do anything additional on this. I'm hoping to get back into it in a week or so. If you deem it important to get started earlier, please feel free to reassign - I don't want to delay anyone who is waiting on this.

karelz commented 7 years ago

No worries. If someone wants to grab it, they can speak up. It is in Future milestone, which means we won't look at it ourselves anytime soon.

danmoseley commented 7 years ago

Happy to reassign if you free up...

danmoseley commented 6 years ago

Costed at 3-4 days if @JeremyKuhne does it. Depends on dotnet/corefx#25539 to add Span overloads first.

niemyjski commented 6 years ago

I'd really love to know the perf numbers of the File.Create, File.Exists and File.Delete operations cross platform. In our app (Exceptionless), we've noticed better performance on on windows and we are trying to track it down. Currently we are looking to replace all unneeded operations in our storage provider and just catch exceptions. https://github.com/FoundatioFx/Foundatio/blob/master/src/Foundatio/Storage/FolderFileStorage.cs

JeremyKuhne commented 6 years ago

@niemyjski Exists isn't horrible- you'd have to call it an awful lot to get a noticeable hit.

Directory enumeration is the most expensive API we have- a good place to focus your investigation. I recently did a big update for Windows with enumeration perf and I'm looking to provide high-performance extension points cross plat. dotnet/corefx#25873

If you get some measurements of where your app is spending time in System.IO that you think can be improved feel free to open issues and loop me in.

danmoseley commented 6 years ago

@Anipik this is one you could look at when you have finished other tasks and have free time later. @JeremyKuhne can give more details.

niemyjski commented 6 years ago

I forgot to post this on here when I was doing profiling work some time ago.

screen shot 2017-12-06 at 8 14 33 am

JeremyKuhne commented 6 years ago

@niemyjski I made significant improvements to FillAttributeInfo. The marshaling involved now uses blittable structs and it only falls back to FindFirstFile in the needed error case. Some of the marshaling changes are in 4.7.2.

https://github.com/dotnet/corefx/blob/master/src/System.IO.FileSystem/src/System/IO/FileSystem.Windows.cs#L191

The key place the existing code is still bad is that it try/catches for non-existence. That is the first thing we should try to remove here. Additionally we can look at removing the normalization (e.g. Path.GetFullPath). The OS doesn't need us to do that work and we've removed almost all of the pre-checks from that code. If we do remove the GetFullPath() we would need to allow for the fact that the p/invokes assume normalization has occurred. That entails trimming trailing periods, spaces, and separators if the path isn't a device path. If it is a device path just the trailing separator has to be removed.

JeremyKuhne commented 5 years ago

Not as big of an issue as it used to be. Closing.