dazinator / DotNet.Glob

A fast globbing library for .NET / .NETStandard applications. Outperforms Regex.
MIT License
363 stars 27 forks source link

Characters missing from list of allowable path characters #52

Closed cocowalla closed 6 years ago

cocowalla commented 6 years ago

The readme says:

By default, when your glob pattern is parsed, DotNet.Glob will only allow literals which are valid for path / directory names. These are:

Any Letter (A-Z, a-z) or Digit
., , !, #, -, ;, =, @, ~, _, :

Maybe I'm misunderstanding this section, but on all of Windows, Linux and MacOS, lot's of other characters are valid in file system paths, such as:

Also, on Windows : is not valid.

dazinator commented 6 years ago

Yeah you are right - this needs sorting.

Firstly the doc should say

Not just A-Z, a-z or Digit. Because it actually does Char.IsLetterOrDigit(). So in your example for 你好!the first two of those characters return true for the IsLetterOrDigit check so they are currently allowed.

The last character however !is categorised as unicode punctuation and so fails the current check for a valid literal, but yes it is valid for a filename.

However taken into a wider context - I don't think this "allowed literal characters" limitation is really helping anything much and I think I should just remove it. Like you say the set of allowed characters will differ per platform and that's not something I want to get into really.

This originally evolved because I wanted to identify if a character was a literal, and I thought if there was a small subset / array I could check in that pretty quickly. But actually the better approach seems to be parse for literals last after checking for other kinds of tokens, and then assume that as its not any other kind of token, then it must be a literal that remains. This way you only need to identify that the next character is not any other token rather than the next character is in a list of known good literal characters. The two checks are roughly the same performance wise.

So here is what I think I should do:

  1. Drop the set of allowed literal characters.
  2. Remove the option AllowInvalidPathCharacters.

The default behaviour will then just be that the character will be assumed to be a literal if it isn't parsed as any other token first - which is how AllowInvalidPathCharacters= true behaves.

cocowalla commented 6 years ago

Yep, treating anything that isn't a special character as a literal makes sense to me

dazinator commented 6 years ago

Thanks @cocowalla this will be releases in 2.1.0