[Suggestion] Finding pair pattern

scriptingstudio commented 1 year ago

Using regex would be easier to parse PSP files. I did not test complex files but your example works perfect.

$psp = @'
<%Param($first,$second)%>
<!doctype html>
<html>
    <head>
        <title>Powershell Server Pages (PSP) demo</title>
        <meta charset="utf-8">
    </head>
    <body>
        <H3>Powershell Server Pages (PSP) demo</H3>
        <hr />
        Hello <% = $first %>,
<%
Write-Output "<br />"
Write-Output $second
%>
        <hr />
    </body>
</html>
'@

[regex]::matches($psp,'<%([^%]+)%>','IgnorePatternWhitespace,Multiline')

Groups   : {0, 1}
Success  : True
Name     : 0
Captures : {0}
Index    : 0
Length   : 25
Value    : <%Param($first,$second)%>

Groups   : {0, 1}
Success  : True
Name     : 0
Captures : {0}
Index    : 224
Length   : 14
Value    : <% = $first %>

Groups   : {0, 1}
Success  : True
Name     : 0
Captures : {0}
Index    : 241
Length   : 51
Value    : <%
           Write-Output "<br />"
           Write-Output $second
           %>

MScholtes commented 1 year ago

Hello @scriptingstudio,

I was investigating regular expressions first too, but found no solutions for situations like opening <% without closing or expressions like

<% Write-Output 5%2 %>

or

<% Write-Output "<% Hello %>"%>

So I decided to parse. And while I'm parsing, I parse everything. But I know it's by far not as efficient as using regular expressions.

If you have a good idea, I would appreciate it.

Greetings

Markus

scriptingstudio commented 1 year ago

I would use a standard notation for html tags like PHP does. Then one of the "good" ideas could be a function

function Get-MarkupTag {
    [CmdletBinding()]
    param (
        [Parameter(ValueFromPipeline,ValueFromPipelineByPropertyName)]
        [alias('inputobject')]$tag,
        [string]$html,
        [alias('id')][string]$token # search pattern
    )
    begin {
        $replacements = @{
            '<BR>'='<BR />';  '<HR>'='<HR />'
            "&nbsp;"   = ' '; '&macr;'   = '¯'
            '&ETH;'    = 'Ð'; '&para;'   = '¶'
            '&yen;'    = '¥'; '&ordm;'   = 'º'
            '&sup1;'   = '¹'; '&ordf;'   = 'ª'
            '&shy;'    = '';  '&sup2;'   = '²'
            '&Ccedil;' = 'Ç'; '&Icirc;'  = 'Î'
            '&curren;' = '¤'; '&frac12;' = '½'
            '&sect;'   = '§'; '&Acirc;'  = 'â'
            '&Ucirc;'  = 'Û'; '&plusmn;' = '±'
            '&reg;'    = '®'; '&acute;'  = '´'
            '&Otilde;' = 'Õ'; '&brvbar;' = '¦'
            '&pound;'  = '£'; '&Iacute;' = 'Í'
            '&middot;' = '·'; '&Ocirc;'  = 'Ô'
            '&frac14;' = '¼'; '&uml;'    = '¨'
            '&Oacute;' = 'Ó'; '&deg;'    = '°'
            '&Yacute;' = 'Ý'; '&Agrave;' = 'À'
            '&Ouml;'   = 'Ö'; '&quot;'   = '"'
            '&Atilde;' = 'Ã'; '&THORN;'  = 'Þ'
            '&frac34;' = '¾'; '&iquest;' = '¿'
            '&times;'  = '×'; '&Oslash;' = 'Ø'
            '&divide;' = '÷'; '&iexcl;'  = '¡'
            '&sup3;'   = '³'; '&Iuml;'   = 'Ï'
            '&cent;'   = '¢'; '&copy;'   = '©'
            '&Auml;'   = 'Ä'; '&Ograve;' = 'Ò'
            '&Aring;'  = 'Å'; '&Egrave;' = 'È'
            '&Uuml;'   = 'Ü'; '&Aacute;' = 'Á'
            '&Igrave;' = 'Ì'; '&Ntilde;' = 'Ñ'
            '&Ecirc;'  = 'Ê'; '&cedil;'  = '¸'
            '&Ugrave;' = 'Ù'; '&szlig;'  = 'ß'
            '&raquo;'  = '»'; '&euml;'   = 'ë'
            '&Eacute;' = 'É'; '&micro;'  = 'µ'
            '&not;'    = '¬'; '&Uacute;' = 'Ú'
            '&AElig;'  = 'Æ'; '&euro;'   = "€"        
        }       
        foreach ($r in $replacements.GetEnumerator()) {
            $l = 0 
            do {
                $l = $html.IndexOf($r.Key, $l, [StringComparison]'CurrentCultureIgnoreCase')
                if ($l -ne -1) {
                    $html = $html.Remove($l, $r.Key.Length)
                    $html = $html.Insert($l, $r.Value)
                }
            } while ($l -ne -1)         
        }
    }
    process {   
        $r = [Regex]::new("</$tag>", 'Singleline,IgnoreCase')
        $endTags = @($r.Matches($html))
        $r = [Regex]::new("<$tag[^>]*>", 'Singleline,IgnoreCase')
        $startTags = @($r.Matches($html))
        $tagText = [System.Collections.Generic.List[object]]::new()
        if ($startTags.Count -eq $endTags.Count) {
            $allTags   = $startTags + $endTags | Sort-Object Index   
            $startTags = [Collections.Stack]::new()
            foreach ($t in $allTags) {
                if (-not $t) {continue} 
                if ($t.Value -like "<$tag*") {
                    $startTags.Push($t)
                } else {
                    $start = $startTags.Pop()
                    $tagText.add($html.Substring($start.Index, $t.Index + $t.Length - $start.Index))
                }
            }
        } else {
            # Unbalanced document, use start tags only and make sure that the tag is self-enclosed
            $startTags.Foreach{
                $t = "$($_.Value)"
                if ($t -notlike '*/>') {
                    $t = $t.Insert($t.Length - 1, '/')
                }
                $tagText.add($t)
            } 
        }
        foreach ($t in $tagText) {
            if (-not $t) {continue}
            # Correct HTML which doesn't quote the attributes so it can be coerced into XML
            $inTag = $false
            for ($i = 0; $i -lt $t.Length; $i++) {
                if ($t[$i] -eq '<') {
                    $inTag = $true
                } else {
                    if ($t[$i] -eq '>') {$inTag = $false}
                }
                if ($inTag -and ($t[$i] -eq '=')) {
                    if ($t[$i + 1] -notmatch '[''|"]') {
                        $endQuoteSpot = $t.IndexOfAny(' >', $i + 1)
                        # Find the end of the attribute, then quote
                        $t = $t.Insert($i + 1, "'")
                        $t = $t.Insert($endQuoteSpot + 1, "'")                    
                    } else {
                        # Make sure the quotes are correctly formatted, otherwise,
                        # end the quotes manually
                        $whichQuote   = "$($Matches.Values)"
                        $endQuoteSpot = $t.IndexOf($whichQuote, $i + 2)
                    }
                    $i = $endQuoteSpot
                }
            }        
            if ($token) {if ($t -match $token) {$t -replace "<$tag>|</$tag>"}} 
            else {$t -replace "<$tag>|</$tag>"}
        } # end $tagText
    } # end process
} # END Get-MarkupTag

$psp = @'
<psp>param($first,$second)</psp>
<!doctype html>
<html>
    <head>
        <title>Powershell Server Pages (PSP) demo</title>
        <meta charset="utf-8">
    </head>
    <body>
        <H3>Powershell Server Pages (PSP) demo</H3>
        <hr />
        Hello <psp> = $first </psp>,
<psp>
Write-Output "<br />"
Write-Output $second
</psp>
        <hr />
    </body>
</html>
'@

Get-MarkupTag -tag 'psp' -html $psp

# capture 1
param($first,$second)
# capture 2
 = $first

# capture 3

Write-Output "<br />"
Write-Output $second

It is not elegant as regex but functional

MScholtes commented 1 year ago

Hello @scriptingstudio,

I also thought of a function at first. But I'm not a fan of defining a function for an algorithm that is then only called once.

The reason to use the <% and %> tags is that this is a Windows web server. Therefore, the notation is based on IIS, which uses ASP as its script language with the same tags. So PSP (Powershell Server Pages) is a non-serious reference to ASP (Active Server Pages).

But, many thanks for your ideas.

Greetings

Markus

scriptingstudio commented 1 year ago

Offtopic: admin permissions are also required for default port 80

MScholtes commented 1 year ago

Offtopic too: all ports up to port 1024 are protected and require admin permissions to use.

MScholtes / WebServer

[Suggestion] Finding pair pattern #6