Make `ConvertFrom-StringData` more error tolerant

PowerShell / PowerShell

PowerShell for every system!

https://microsoft.com/PowerShell

MIT License

43.66k stars 7.08k forks source link

Make `ConvertFrom-StringData` more error tolerant #23790

Open CrendKing opened 2 weeks ago

CrendKing commented 2 weeks ago

Summary of the new feature / enhancement

Currently ConvertFrom-StringData immediately throws exception if it encounters an invalid data line or duplicate key. Is it a good idea to add an option so that ConvertFrom-StringData would ignore and skip the erroneous line and continue? This is different to something like -ErrorAction SilentlyContinue, which skips the whole command.

For example, if we use curl to send a HEAD request, we get headers back, which is in the ": " form. However, it also returns the HTTP response code line at the top:

> curl --silent --head https://www.example.com
HTTP/2 200
content-encoding: gzip
accept-ranges: bytes
age: 272952
...

Piping it to ConvertFrom-StringData -Delimiter ':' results "ConvertFrom-StringData: Data line 'HTTP/2 200' is not in 'name=value' format." Of course we can Select-Object -Skip 1 to workaround in this case, but such workaround is not always possible (e.g. invalid data appear in an undetermined line).

Implementation wise, just use the option to continue instead of throw at https://github.com/PowerShell/PowerShell/blob/0f3fd2077cfd63b298ce7f88babe041f4dc6849e/src/Microsoft.PowerShell.Commands.Utility/commands/utility/ConvertFrom-StringData.cs#L68 and https://github.com/PowerShell/PowerShell/blob/0f3fd2077cfd63b298ce7f88babe041f4dc6849e/src/Microsoft.PowerShell.Commands.Utility/commands/utility/ConvertFrom-StringData.cs#L78.

Proposed technical implementation details (optional)

One can write a function to simulate the feature:

function ConvertFrom-PossiblyBadStringData {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory, ValueFromPipeline)]
        $str
    )

    begin { $retValue = @{} }
    process {
        $str -split "`n" | ForEach-Object {
            try {
                $lineData = ConvertFrom-StringData $_
                foreach ($entry in $lineData.GetEnumerator()) {
                    if (!$retValue.ContainsKey($entry.Key)) {
                        $retValue.Add($entry.Key, $entry.Value)
                    }
                }
            } catch {
            }
        }
    }
    end { $retValue }
}

@'
a = 1
b = 2
ccccc
d = 3
a = 4
'@ | ConvertFrom-PossiblyBadStringData

Output:

Name                           Value
----                           -----
b                              2
a                              1
d                              3

rhubarb-geek-nz commented 2 weeks ago

It is interesting that you are using an HTTP response as your example, In HTTP protocol you can have multiple entries for the same header which would be incompatible with putting in a hash-table because you can't have duplicate keys.

(Invoke-WebRequest -Uri https://github.com/favicon.ico -Method HEAD).Headers

Key                 Value
---                 -----
Server              {GitHub.com}
Date                {Mon, 13 May 2024 01:49:48 GMT}
ETag                {W/"664165de-1976"}
Cache-Control       {max-age=315360000}
Vary                {Accept-Encoding, Accept, X-Requested-With}
X-Frame-Options     {DENY}
Set-Cookie          {_gh_sess=A3nCvWqXhb%2BiY8%2FYndCFdikMaBkl%2B6w1qj33QJ5f6nbd1q2dwHKyzjr7KXGsLjQZvC13xyHbpptRSW38GAb%2BcUfHCxsEBHUUSqSjxcBhgEDuYCO8XsQqDmJdtg%2BNaypPv…
Accept-Ranges       {bytes}
X-GitHub-Request-Id {E8AC:A2755:4F712F9:5686B67:66417331}
Content-Type        {image/x-icon}
Last-Modified       {Mon, 13 May 2024 00:59:10 GMT}
Expires             {Thu, 11 May 2034 01:49:48 GMT}
Content-Length      {6518}

The http response effectively treats header values as dictionary of list of strings.

CrendKing commented 2 weeks ago

It is just an example. The point is that there are data that mostly conforms to the <key> <delimiter> <value> format with slight deviations, and I want ConvertFrom-StringData to be able to return the "mostly conformant" parts instead of throwing outright.

mklement0 commented 2 weeks ago

@CrendKing:

Pragmatically speaking, you could simply filter out lines that aren't of interest along the lines of:

curl --silent --head https://www.example.com | 
  Select-String -Raw : |
  ConvertFrom-StringData -Delimiter :

However, note that the above creates a hashtable for each and every line (of interest) output by curl, and that, in general, stdout output from external (native) programs is streamed line by line to PowerShell's pipeline.

As a result, the above outputs multiple, single-entry hashtables rather than a single hashtable.

To create a single hashtable, you'd have to provide all input lines as a single string, which in the simplest case means piping to Out-String:

curl --silent --head https://www.example.com | 
  Select-String -Raw : |
  Out-String |
  ConvertFrom-StringData -Delimiter :

However, duplicate keys - which, as @rhubarb-geek-nz points out, are a possibility in HTTP headers - then would present a problem - justifiably.

However, you did uncover what to me appears to be a genuine bug (since reported in #23793):

With multiple, individual strings as input, ConvertFrom-StringData mistakenly aborts processing overall, via a statement-terminating error, once an incorrectly formatted string is encountered; the expected behavior would be to merely emit a non-terminating error in that event, and continue processing further inputs:

'foo=bar', 'broken', 'bar=baz' | ConvertFrom-StringData

Currently, the output is:

Name                           Value
----                           -----
foo                            bar
ConvertFrom-StringData: Data line 'broken' is not in 'name=value' format.

That is, overall processing was unexpectedly aborted when the broken input string was encountered.

The output should be:

Name                           Value
----                           -----
foo                            bar
ConvertFrom-StringData: Data line 'broken' is not in 'name=value' format.
bar                            baz

That is, processing should have continued after the incorrectly formatted input string (resulting in two hashtables overall).

Also, 2>$null or -ErrorAction SilentlyContinue / -ErrorAction Ignore should be effective in silencing / ignoring any input-string-specific error messages, which currently does not work, because the error is unexpectedly statement-terminating rather than non-terminating.

CrendKing commented 2 weeks ago

Thanks. Please be understanding that I only used curl and HTTP as an example because it is common. If it is causing confusion, I can replace the example with things like output from ffmpeg/ffprobe. Just let me know.

The 'foo=bar', 'broken', 'bar=baz' | ConvertFrom-StringData case is different to my one string use case, but the root cause seems to be the same. If some -ErrorAction would make it continue, that'd be great.

Pragmatically speaking, you could simply filter out lines that aren't of interest along the lines of:

Yes, I could. But why encourage the users to come up with case-by-case filtering that not only spend CPU time to take out data we don't need to begin with at all, while the proposed update to ConvertFrom-StringData achieves exact the same without sacrifice anything? What is the benefit of the current ConvertFrom-StringData behavior?

rhubarb-geek-nz commented 2 weeks ago

What is the benefit of the current ConvertFrom-StringData behavior?

Strictly enforcing data formats improves data quality throughout a system. If you have a relaxed system that accepts anything but ignores it all then you will have trouble later on finding out why none of your values made it though the system. A good recent example is ConvertFrom-JSON which previously accepted, without error, data that basically was not JSON, it allowed C++ comments, single quotes instead of double quotes etc, So while your parser may interpret the data you would have rude shock when you pass the data to a 3rd party RESTful service and it rejects it outright as not being JSON.

Stick to the "do one thing well" and have ConvertFrom-StringData do the name value conversions and leave filtering to other components that are better suited and can be customised.

CrendKing commented 2 weeks ago

I agree the default behavior should be strict. I'm talking about adding an option or -ErrorAction to relax it. It is a conscious action from user to use that option, they already know the potential of incomplete conversion.

Another example. Invoke-WebRequest by default throws exception if the response HTTP code is error (400-500). That leaves no room to check the header or content of the response. Fortunately there is -SkipHttpErrorCheck. And when we use that option, we know we should always check for the HTTP code. Imagine if such option never existed.

mklement0 commented 2 weeks ago

@CrendKing, note that even if the true bug gets fixed (which I've since reported separately, in #23793), it won't help your use case:

While you will then be able to use, say, -ErrorAction Ignore to ignore the errors from individual input strings, you'll still get a separate, single-entry output hashtable for each (well-formed) input string.
However, if you pass a single, multi-line string (e.g. curl --silent --head https://www.example.com | Out-String) as the input, that string as a whole will result in a non-terminating error, and no hashtable will be constructed.

So, to spell it out, what you're asking for would require modifying ConvertFrom-StringData to emit non-terminating errors for individual lines in a given, single, multi-line string and to still emit a hashtable if at least one well-formed line was found in the input string (I now see that you're already pointing to the relevant source-code location in your original post).

On a related note, the current behavior of emitting a separate hashtable for each input string is awkward, especially when piping an external program's output directly to ConvertFrom-StringData, which happens line by line (as in your example; with input from a file, Get-Content -Raw is the solution). It would be much more useful to collect all input strings first, and then emit a single hashtable for all of them (the way ConvertFrom-Json does, for instance).

Both changes discussed are technically breaking ones, though the former is more likely to fall into bucket 3, whereas I think the latter, if it gets implemented, would require opt-in in the form of a new parameter to avoid breaking existing code.

CrendKing commented 2 weeks ago

Thanks. Like you said, my original request is to make the function return non-empty Hashtable from single string input when some line can't be converted. I even showed a workaround that break the single string input into lines, try-catch convert each line and merge all single-entry Hashtables into one. It is doable but very awkward.

Also, if this new behavior is protected by a NEW OPTION, for instance ConvertFrom-StringData -SkipErrorLines, I don't think it is backward incompatible change. On the other hand, make ConvertFrom-StringData -ErrorAction Ignore ignore lines or input entry is backward incompatible, even though I believe the current behavior is strict up wrong, since it is not ignoring error at all.

mklement0 commented 2 weeks ago

ConvertFrom-StringData -ErrorAction Ignore ignore lines or input entry is backward incompatible

Well, it is backward-incompatible in the sense that any bug fix is technically a breaking change. The symptom is a side effect of the cmdlet (in effect) creating a statement-terminating error, to which -ErrorAction by (current) design does not apply (see #14819); fixing #23793 will make this symptom go away.

That said, even though it would clearly be a bug fix, it has the potential to break existing code, given that using a try statement would no longer be effective for catching an error - unless -ErrorAction Stop is also used.

a NEW OPTION

Good point, introducing a new switch is a way to avoid backward-compatibility problems, but perhaps the need for that can be avoided if the suggested change is deemed to be a bucket-3 change - but someone would have to do some digging (e.g. analyzing code here on GitHub) to ascertain that.

That said, if fixing #23793 is decided against, a new switch is unavoidable.

I even showed a workaround that break the single string input into lines

Yes, though note that your workaround wouldn't work with curl --silent --head https://www.example.com as input, as each output line would be received separately in the process block; I know it's just a sample command, but the point is that it the problem applies to any external-program call.

To avoid a need for Out-String or similar, you'd have to collect all input strings in the process block (e.g. by building up a multiline string), and process the collected lines in the end block. This relates to additional change I suggested: to make ConvertFrom-StringData perform this collecting itself, though that most likely requires a new switch.