Closed scriptingstudio closed 1 year ago
Hello @scriptingstudio,
I was investigating regular expressions first too, but found no solutions for situations like opening <% without closing or expressions like
<% Write-Output 5%2 %>
or
<% Write-Output "<% Hello %>"%>
So I decided to parse. And while I'm parsing, I parse everything. But I know it's by far not as efficient as using regular expressions.
If you have a good idea, I would appreciate it.
Greetings
Markus
I would use a standard notation for html tags like PHP does. Then one of the "good" ideas could be a function
function Get-MarkupTag {
[CmdletBinding()]
param (
[Parameter(ValueFromPipeline,ValueFromPipelineByPropertyName)]
[alias('inputobject')]$tag,
[string]$html,
[alias('id')][string]$token # search pattern
)
begin {
$replacements = @{
'<BR>'='<BR />'; '<HR>'='<HR />'
" " = ' '; '¯' = '¯'
'Ð' = 'Ð'; '¶' = '¶'
'¥' = '¥'; 'º' = 'º'
'¹' = '¹'; 'ª' = 'ª'
'­' = ''; '²' = '²'
'Ç' = 'Ç'; 'Î' = 'Î'
'¤' = '¤'; '½' = '½'
'§' = '§'; 'Â' = 'â'
'Û' = 'Û'; '±' = '±'
'®' = '®'; '´' = '´'
'Õ' = 'Õ'; '¦' = '¦'
'£' = '£'; 'Í' = 'Í'
'·' = '·'; 'Ô' = 'Ô'
'¼' = '¼'; '¨' = '¨'
'Ó' = 'Ó'; '°' = '°'
'Ý' = 'Ý'; 'À' = 'À'
'Ö' = 'Ö'; '"' = '"'
'Ã' = 'Ã'; 'Þ' = 'Þ'
'¾' = '¾'; '¿' = '¿'
'×' = '×'; 'Ø' = 'Ø'
'÷' = '÷'; '¡' = '¡'
'³' = '³'; 'Ï' = 'Ï'
'¢' = '¢'; '©' = '©'
'Ä' = 'Ä'; 'Ò' = 'Ò'
'Å' = 'Å'; 'È' = 'È'
'Ü' = 'Ü'; 'Á' = 'Á'
'Ì' = 'Ì'; 'Ñ' = 'Ñ'
'Ê' = 'Ê'; '¸' = '¸'
'Ù' = 'Ù'; 'ß' = 'ß'
'»' = '»'; 'ë' = 'ë'
'É' = 'É'; 'µ' = 'µ'
'¬' = '¬'; 'Ú' = 'Ú'
'Æ' = 'Æ'; '€' = "€"
}
foreach ($r in $replacements.GetEnumerator()) {
$l = 0
do {
$l = $html.IndexOf($r.Key, $l, [StringComparison]'CurrentCultureIgnoreCase')
if ($l -ne -1) {
$html = $html.Remove($l, $r.Key.Length)
$html = $html.Insert($l, $r.Value)
}
} while ($l -ne -1)
}
}
process {
$r = [Regex]::new("</$tag>", 'Singleline,IgnoreCase')
$endTags = @($r.Matches($html))
$r = [Regex]::new("<$tag[^>]*>", 'Singleline,IgnoreCase')
$startTags = @($r.Matches($html))
$tagText = [System.Collections.Generic.List[object]]::new()
if ($startTags.Count -eq $endTags.Count) {
$allTags = $startTags + $endTags | Sort-Object Index
$startTags = [Collections.Stack]::new()
foreach ($t in $allTags) {
if (-not $t) {continue}
if ($t.Value -like "<$tag*") {
$startTags.Push($t)
} else {
$start = $startTags.Pop()
$tagText.add($html.Substring($start.Index, $t.Index + $t.Length - $start.Index))
}
}
} else {
# Unbalanced document, use start tags only and make sure that the tag is self-enclosed
$startTags.Foreach{
$t = "$($_.Value)"
if ($t -notlike '*/>') {
$t = $t.Insert($t.Length - 1, '/')
}
$tagText.add($t)
}
}
foreach ($t in $tagText) {
if (-not $t) {continue}
# Correct HTML which doesn't quote the attributes so it can be coerced into XML
$inTag = $false
for ($i = 0; $i -lt $t.Length; $i++) {
if ($t[$i] -eq '<') {
$inTag = $true
} else {
if ($t[$i] -eq '>') {$inTag = $false}
}
if ($inTag -and ($t[$i] -eq '=')) {
if ($t[$i + 1] -notmatch '[''|"]') {
$endQuoteSpot = $t.IndexOfAny(' >', $i + 1)
# Find the end of the attribute, then quote
$t = $t.Insert($i + 1, "'")
$t = $t.Insert($endQuoteSpot + 1, "'")
} else {
# Make sure the quotes are correctly formatted, otherwise,
# end the quotes manually
$whichQuote = "$($Matches.Values)"
$endQuoteSpot = $t.IndexOf($whichQuote, $i + 2)
}
$i = $endQuoteSpot
}
}
if ($token) {if ($t -match $token) {$t -replace "<$tag>|</$tag>"}}
else {$t -replace "<$tag>|</$tag>"}
} # end $tagText
} # end process
} # END Get-MarkupTag
$psp = @'
<psp>param($first,$second)</psp>
<!doctype html>
<html>
<head>
<title>Powershell Server Pages (PSP) demo</title>
<meta charset="utf-8">
</head>
<body>
<H3>Powershell Server Pages (PSP) demo</H3>
<hr />
Hello <psp> = $first </psp>,
<psp>
Write-Output "<br />"
Write-Output $second
</psp>
<hr />
</body>
</html>
'@
Get-MarkupTag -tag 'psp' -html $psp
# capture 1
param($first,$second)
# capture 2
= $first
# capture 3
Write-Output "<br />"
Write-Output $second
It is not elegant as regex but functional
Hello @scriptingstudio,
I also thought of a function at first. But I'm not a fan of defining a function for an algorithm that is then only called once.
The reason to use the <% and %> tags is that this is a Windows web server. Therefore, the notation is based on IIS, which uses ASP as its script language with the same tags. So PSP (Powershell Server Pages) is a non-serious reference to ASP (Active Server Pages).
But, many thanks for your ideas.
Greetings
Markus
Offtopic: admin permissions are also required for default port 80
Offtopic too: all ports up to port 1024 are protected and require admin permissions to use.
Using regex would be easier to parse PSP files. I did not test complex files but your example works perfect.