PowerShell / PowerShell

PowerShell for every system!
https://microsoft.com/PowerShell
MIT License
43.55k stars 7.06k forks source link

PowerShell extended property `BaseName` for `DirectoryInfo` is inconsistent when there is an `extension` #21553

Closed Hashbrown777 closed 2 days ago

Hashbrown777 commented 2 weeks ago

Prerequisites

Steps to reproduce

This occurs on both windows and linux, starting at v5 all the way through to now. <#System.IO.FileInfo#> | %{ $_.BaseName + $_.Extension } should always be equivalent to <#System.IO.FileInfo#>.Name For gci -File this holds true. For gci -Directory although pwsh correctly has .BaseName always match .Name (folders cannot have extensions..), .Extension incorrectly matches .Name -replace '^.*?(?=\.[^.]*$|$)','' instead of always returning ""

Expected behavior

PS> (New-Item -Type Directory -Name 'bob.steve 123_456[yoyo]').Extension

PS>

Actual behavior

PS> (New-Item -Type Directory -Name 'bob.steve 123_456[yoyo]').Extension
.steve 123_456[yoyo]
PS>

Error details

No response

Environment data

Name                           Value
----                           -----
PSVersion                      7.4.2
PSEdition                      Core
GitCommitId                    7.4.2
OS                             Fedora Remix for WSL
Platform                       Unix
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Visuals

image

Hashbrown777 commented 2 weeks ago

$_.Name.Substring($_.BaseName.length) is a succinct way to reliably get the accurate Extension

rhubarb-geek-nz commented 2 weeks ago

Where is the rule defined that a directory cannot have a file extension?

A directory is still a file, and on POSIX systems the file system does not know about extensions at all, a file extension is purely down to an application's interpretation. The Windows file system is aware of file extensions to a point, since 8.3 names but with NTFS and ability to have multiple periods in a name it is not really the role of the file system to make any inferences.

On a Linux system you will find many directories ending with ".d" under /etc and I would consider that a file extension.

So extension is really in the eye of the beholder.

PS> get-childitem .

    Directory: /home/bythesea

UnixMode         User Group         LastWriteTime         Size Name
--------         ---- -----         -------------         ---- ----
drwxr-xr-x     bythesea users      04/30/2024 06:17         4096 foo.bar

PS> get-childItem . | Select-Object -Property Extension,BaseName

Extension BaseName
--------- --------
.bar      foo.bar
Hashbrown777 commented 2 weeks ago

On a Linux system you will find many directories ending with ".d" under /etc and I would consider that a file extension.

Yes, and in those cases the basename would exclude the ".d"; you cannot have it both ways. Also, having ".steve 123_456[yoyo]" as an extension is a far cry from ".d", but instead of asking for a reform of how powershell interprets extensions, I just want it to be consistent; if it says directories basenames are always the same as the name, then the extensions must always be empty.

rhubarb-geek-nz commented 2 weeks ago

Yes, name != basename + extension

PS /etc> get-childitem rc* | Select-Object Name,BaseName,Extension

Name     BaseName Extension
----     -------- ---------
rc0.d    rc0.d    .d
rc1.d    rc1.d    .d
rc2.d    rc2.d    .d
rc3.d    rc3.d    .d
rc4.d    rc4.d    .d
rc5.d    rc5.d    .d
rc6.d    rc6.d    .d
rcS.d    rcS.d    .d
rc.local rc       .local
Hashbrown777 commented 2 weeks ago

Yes, name != basename + extension

And where is that defined? (genuinely, it's not there) I'm pretty sure in every language basename is defined in terms of full path sans the parent directories and any extension. https://linux.die.net/man/1/basename https://www.php.net/manual/en/function.basename.php

Posting more examples of what I'm describing as a bug isn't helping prove anything?

Most languages either A dont provide a basename, allowing the application to be the beholder as it were, or B allow said application to provide the suffix explicitly. Powershell has taken it upon itself to interpret the extension, and if the programmer decides to use this value, the API needs to be self-consistent.

Just a moment...
PHP: basename - Manual
rhubarb-geek-nz commented 2 weeks ago

Yes, name != basename + extension

And where is that defined?

Sorry, I was just confirming what I was seeing, that the basename still had the ,d on the end

Hashbrown777 commented 2 weeks ago

I was just confirming what I was seeing

Ah, I thought that was an offered explanation; "Yes, [because] name isnt.." not "Yes, [I see in pwsh] name isnt.."

But seriously, it's bizarre I can't find doco on .Basename..

rhubarb-geek-nz commented 2 weeks ago

This occurs on both windows and linux, starting at v5 all the way through to now.

This may make it hard to make any change other than document the behaviour. We don't have any idea how much existing code is relying on the existing behaviour.

Hashbrown777 commented 2 weeks ago

existing code is relying on the existing behaviour.

Dont let pwsh be another cmd.exe, we'll see what the devs say. No-one would be relying on this, they'd be compensating for it. That's kind of why we have a v7 in the first place, people who write code stuck in time stay on 5

237dmitry commented 2 weeks ago
$ gi ./foo.bar/ | ft Name, BaseName, Extension

Name    BaseName Extension
----    -------- ---------
foo.bar foo.bar  

$  gi ./foo.bar | ft Name, BaseName, Extension 

Name    BaseName Extension
----    -------- ---------
foo.bar foo.bar  .bar
Hashbrown777 commented 2 weeks ago

That's an interesting observation, and unfortunately not leverageable as a workaround (like for the example screenshot) :(

rhubarb-geek-nz commented 2 weeks ago

$ gi ./foo.bar/ | ft Name, BaseName, Extension

That's an interesting observation, and unfortunately not leverageable as a workaround

The Microsoft build tools keep the trailing slash on directory names, so you don't need to append it when constructing full paths, eg

   <FilesToDelete Include="$(PublishDir)$(AssemblyName).deps.json" />
   <FilesToDelete Include="$(PublishDir)$(AssemblyName).pdb" />

It has a couple advantages

However you do need to look for both '/' and '\'

Hashbrown777 commented 2 weeks ago

I'm unsure how that's relevant to an example that just wants to treat directories and files the same and have the output be predictable, nothing's trying to append paths or fetching items directly using known paths.

For instance, trying to "clone" a directory by making symlinks whilst injecting into the new names:

gci 'version1/' | %{ ln -s $_ "export/$($_.BaseName).version1$($_.Extension)" }

Knowing gi $directPathWithSlash changes $_.Extension wont help because "$" is a result of a search, not picking up a specific path. Using `$.Name.Substring($.BaseName.length)in place of$.Extension` does function though.

jborean93 commented 2 weeks ago

But seriously, it's bizarre I can't find doco on .Basename..

BaseName is not a property on the DirectoryInfo/FileInfo types in .NET but part of an ETS member added by PowerShell. You can see that on PowerShell 7 it's a ScriptProperty that is simply an alias for Name for DirectoryInfo and a more complex script property for FileInfo

PS /home/jborean> Get-Item $pwd | Get-Member -name BaseName

   TypeName: System.IO.DirectoryInfo

Name     MemberType     Definition
----     ----------     ----------
BaseName ScriptProperty System.Object BaseName {get=$this.Name;}

PS /home/jborean> Get-Item $PSHome/pwsh | Get-Member -Name BaseName

   TypeName: System.IO.FileInfo

Name     MemberType     Definition
----     ----------     ----------
BaseName ScriptProperty System.Object BaseName {get=if ($this.Extension.Length -gt 0){$this.Name.Remove($this.Name.Length - $this.Exte…

You can also use Get-TypeData to see that the BaseName property is set on the type and not just manually added to the instance by Get-Item

PS /home/jborean> (Get-TypeData System.IO.FileInfo).Members.BaseName

GetScriptBlock                                                                                                   SetScriptBlock IsHidde
                                                                                                                                      n
--------------                                                                                                   -------------- -------
if ($this.Extension.Length -gt 0){$this.Name.Remove($this.Name.Length - $this.Extension.Length)}else{$this.Name}                  False

PS /home/jborean> (Get-TypeData System.IO.DirectoryInfo).Members.BaseName

GetScriptBlock SetScriptBlock IsHidden Name
-------------- -------------- -------- ----
$this.Name                       False BaseName
mklement0 commented 2 weeks ago

Let me attempt a summary:

To add to @jborean93's comment re discovery of ETS members: Get-Member -View Extended shows all ETS members associated with a given instance, both instance-level ETS members (created ad hoc) and type-level ones (created via .types.ps1xml files or calls to Update-TypeData) (you won't be able to tell from Get-Member's output whether the members are instance- or type-level).

Hashbrown777 commented 2 weeks ago

Do you know why non-alphanumeric characters, such as spaces and brackets, are permitted to form the .Extension? This wont help my issue, but it seems equally baffling to me.

rhubarb-geek-nz commented 2 weeks ago

Do you know why non-alphanumeric characters, such as spaces and brackets, are permitted to form the .Extension?

I suggest that the definition of file extension is really simple and is just what follows the period in a file name. Even that definition is ambiguous if there are multiple periods. Given there are no restrictions on what may be part of the stem, likewise there does not need to be any restriction on the extension.

A common example in the Microsoft world is using tildes at the end of filenames that are temporary.

Hashbrown777 commented 2 weeks ago

I think that's absurd, though. Consider Mr. Rhubarb.docx and a 'valid' extension being . Rhubarb.docx or basically any hidden, extensionless file on linux literally not even having a name at all and the whole filename is the extension (eg .bashrc, whereas bashrc is the name and it has no extension, it's just hidden).

[regex]'(?<=.)(\.[a-zA-Z0-9_]+|)~?$' I think is what I have in my head (psuedocode..I mean unicode will kill that if there are any extensions out there), but I'm wondering whether [\s()[\]] and other special characters are being used anywhere.

rhubarb-geek-nz commented 2 weeks ago

The concept of valid extension does not exist. There are valid characters in a filename, and the concept of what ever follows the [last] period that that is it. Then also the historical definition of an 8.3 filename.

Sure there are common extensions, and there are extension mappings listed in the registry. Applications can register what they want.

Have a look at

 Get-ChildItem HKCU:Software\Classes | Select-Object PSChildName

The general idea behind extensions is it helps you know how to handle files, whether you can or not. If you don't recognize an extension that is absolutely fine, it means you don't know how to handle the file.

rhubarb-geek-nz commented 2 weeks ago

In UNIX case is also important to certain applications. For instance C++ compilers treat lower-case "'.c" as a C file and ".C" as a C++ file. But that interpretation is down to the applications, there is no governing body allocating valid file extensions or how to interpret them.

Historically Apple, ( of course Apple) had a TYPE/CREATOR registry. The original Macintosh had no concept of file extensions and the type of file was held in the directory entry for the file. Eg TEXT was a text file, PICT was an image, APPL was an application program etc. The equivalent of the Windows extensions mapping was why the Finder was called the Finder. It found the appropriate application for a file based on TYPE and CREATOR. You were supposed to apply to Apple for approval and to register your type and creator.

rhubarb-geek-nz commented 2 weeks ago

Go to a Windows command prompt and type

DIR *.*

then do the same in PowerShell

In the original command prompt, . will list all files, whether they had a period in the name or not. Because that is how it worked on CP/M.

rhubarb-geek-nz commented 2 weeks ago

@rhubarb-geek-nz that has nothing to do with extensions...

Sorry, I am lost now. I don't know what you are wanting to achieve. If you are wanting to find the last period in a name then all you need is System.String.LastIndexOf rather than a regular expression,

Hashbrown777 commented 2 weeks ago

but I'm wondering whether [\s()[]] and other special characters are being used anywhere

I don't think mentioning how extensions can be differentiated on letter-case or recognised at all is helpful in this context because those are already catered for in "my expected extension"™ (where we accept those characters and don't care how they're used).

The concept of valid extension does not exist.

I'm saying I dont think there's ever been a usecase for wanting spaces, periods, and brackets in the extension, and would like to know if there exists preceident for this.

cmd equating (?<!^)\.\* to (\..\*)? is interesting, considering it does match a file called abcd. def() using dir abcd.*

the concept of what ever follows the [last] period that that is it

I mean to keep this in the realm of extensions, there are .tar.gz and .rar.01 et cetera, but it's arguable that that is a usecase handled by the application to not merely recognise those, but to interpret the raw names themselves, and it's not expected that regular API users would want them lumped together. I view spaces and such in the same light.

System.String.LastIndexOf rather than a regular expression,

The regular expression handles this fine, but I'm just using it as a way to communicate rather than listing conditions in english, which would be cumbersome, implementation isn't important.

rhubarb-geek-nz commented 2 weeks ago

I'm saying I dont think there's ever been a usecase for wanting spaces, periods, and brackets in the extension, and would like to know if there exists preceident for this.

Think mechanism not policy. The definition of a file extension as everything after the [last] period has worked for around 50 years. If you want to do something more esoteric, then absolutely fine, but put that in different piece of code. Leave the existing mechanism that works as it is.

mklement0 commented 2 weeks ago

The .NET implementation of the .Extension property is indeed very simple:

Examples:

([System.IO.FileInfo[]] ('foo', 'foo.bar', 'foo. bar.docx', 'foo. bar', 'foo.  ', 'foo.')).Extension |
    % { "[$_]" }

Output:

[]
[.bar]
[.docx]
[. bar]
[.  ]  # on Unix only: on Windows: []
[.]    # on Unix only: on Windows: []

Note that the platform differences with respect to 'foo.' and 'foo. ': on Windows, to avoid creating invalid filenames, the latter names are reflected as just ...\foo in the .FullName property, which the .Extension property operates on (though, curiously, the .Name property reflects the name as given).

# -> '[][.foo]'
[System.IO.FileInfo[]] '.foo'| % { '[{0}][{1}]' -f $_.BaseName, $_.Extension }

Note that the problem of an empty base name doesn't arise in .NET, as .BaseName is purely a PowerShell (ETS) property.


As for cmd.exe's dir *.* behavior:

While PowerShell's own wildcard patterns indeed only return items whose name contains at least one . (which applies to the -Path, and -Include / -Exclude parameters), the -Filter parameter uses the legacy / system-native wildcard matching; in other words: Get-ChildItem -Filter *.* exhibits the same matching behavior as cmd /c dir *.*

Hashbrown777 commented 2 weeks ago

That actually answered a different question I had taboot; 'how can I pass a prospective path to the FS and get it validated/corrected without just trying it and catching an exception?'. I'll have a look at FileInfo casting

mklement0 commented 2 weeks ago

@Hashbrown777, note that a pitfall with casting (which simply translates into a constructor call behind the scenes) is that relative paths are then resolved against the process working directory, which usually differs from PowerShell's; that is, a fully robust cast would have to use [System.IO.FileInfo] (Join-Path (Get-Location -PSProvider FileSystem).ProviderPath 'foo.txt') in order to be correctly resolved against PowerShell's current file-system provider location; if you're willing to assume that PowerShell's current location is a file-system location and that location isn't based on a PowerShell-only drive, [System.IO.FileInfo] "$PWD/foo.txt" will do.

rhubarb-geek-nz commented 2 weeks ago

I'll have a look at FileInfo casting

Fortunately the rules of filenames are very simple.

On Windows

PS> [System.IO.Path]::PathSeparator
;
PS> [System.IO.Path]::DirectorySeparatorChar
\
PS> [System.IO.Path]::GetInvalidFileNameChars() | Where-Object { $_ -gt 32 }
"
<
>
|
:
*
?
\
/

And on UNIX

PS> [System.IO.Path]::PathSeparator
:
PS> [System.IO.Path]::DirectorySeparatorChar
/
PS> [System.IO.Path]::GetInvalidFileNameChars() | Where-Object { $_ -gt 32 }
/

And to avoid the mentioned scenario of trailing spaces use System.String.Trim()

Notice the path separator is not an invalid filename character.

But only the file system can tell you if a particular volume/drive/directory is case sensitive or not.

Seeing the code of a Chinese PowerShell project was an eye-opener, where not only the comments were in Chinese, but so were the file names and even function names, and it worked.

mklement0 commented 2 weeks ago

For mere formal path validation, there's also Test-Path -IsValid, but there are two caveats:

Finally, note that Convert-Path can be used to convert a path based on a PowerShell-only drive to the underlying, native file-system path; e.g.: Convert-Path Temp:\

rhubarb-geek-nz commented 2 weeks ago

I note that on Linux

 Get-ChildItem . -Filter '*.*'

implements the Windows file system filtering convention, which is different from, say, ls *.* in bash

My theory is that on Windows it is implemented by FindFirstFileW so the operating system does the filtering, but POSIX opendir/readdir/closedir don't do any filtering so it is implemented in .NET.

 [System.IO.Directory]::GetFiles('.','*.*')

this includes files without the period

mklement0 commented 2 weeks ago

Good point, @rhubarb-geek-nz - I had wrongly assumed that a platform-native system call would be used on Unix-like platforms.

Yes, PowerShell defers to .NET (FindFirstFileW is only used directly by PowerShell in the context of examining reparse points) and PowerShell explicitly requests the Win32 behavior with its legacy quirks even on Unix-like platforms (which I presume .NET offers as self-implemented emulation):

https://github.com/PowerShell/PowerShell/blob/5efd627e91325a0d51df0167bd08609fa85acb56/src/System.Management.Automation/namespaces/FileSystemProvider.cs#L92-L97


The .NET APIs themselves default to the Windows behavior, albeit inconsistently; see:

Specifically, the .MatchType property of System.IO.EnumerationOptions defaults to System.IO.MatchType.Simple (no legacy quirks; *.* only matches names that contain .), but the .Enumerate*() file/directory-enumeration methods that take no System.IO.EnumerationOptions argument default to System.IO.MatchType.Win32

SteveL-MSFT commented 1 week ago

It seems that the original issue is that DirectoryInfo has Extension property populated, but this is from .NET Runtime, so if that gets addressed, then PS would reflect that change.

mklement0 commented 1 week ago

It seems that the original issue is that DirectoryInfo has Extension property populated

That's not an issue: it is by - to me sensible - design, and I don't think it will change, nor - in my estimation - should it.

As such, I think the Resolution-External label is inappropriate.


The real issue is the - to me dubious - PowerShell behavior of selectively ignoring the name extension in directory names in the - PowerShell-only - BaseName property.

SteveL-MSFT commented 1 week ago

The real issue is the - to me dubious - PowerShell behavior of selectively ignoring the name extension in directory names in the - PowerShell-only - BaseName property.

  • Not only does it introduce the asymmetry between PowerShell and .NET discussed in the initial post, I'm not aware of an intrinsic justification for it.
  • Based on the analysis above, I'd say that fixing the conceptually flawed PowerShell behavior is still a (desirable) option.

Thanks for calling that out. I would agree that it is inconsistent and a question of whether it's really a bucket 3 or not (and I do see you've done some initial research on this, thanks!). I've updated the title of this issue to reflect the core problem. Will tag for WG to discuss.

SteveL-MSFT commented 3 days ago

WG discussed this. Although we agree that the design is not ideal, it was intentional when it was written and there is likely customers depending on this behavior. If we look at how unix systems define basename, then the error in behavior is for FileInfo which should include the extension as that is part of the filename and instead there should have been something like basenamewithoutextension. As such, we accept that this is by-design and would recommend a doc bug to clarify the difference in behavior for FileInfo vs DirectoryInfo.

mklement0 commented 3 days ago

@SteveL-MSFT, while I can appreciated the concern about breaking things, note that the basename Unix utility is not relevant to this discussion, because it is the equivalent of Split-Path -Leaf and therefore unrelated to extensions.

Just to clarify (which may help with documenting):

microsoft-github-policy-service[bot] commented 2 days ago

This issue has been marked as by-design and has not had any activity for 1 day. It has been closed for housekeeping purposes.

microsoft-github-policy-service[bot] commented 2 days ago

📣 Hey @Hashbrown777, how did we do? We would love to hear your feedback with the link below! 🗣️

🔗 https://aka.ms/PSRepoFeedback

Microsoft Forms