Open Stephanevg opened 5 years ago
Hi, I think it's better to use something really dedicated to html. System.xml is xml focused. I tried with the small function i created in #218 and it fails when some "special" html syntaxes are used ( atom stuff .. ).
I tried with the htmlagilitypack and ... well it's html oriented html, and it's almost the same. it also works on pscore (6.2)
it's available here: https://html-agility-pack.net (download the nuget package, and unzip it somewhere )
[Reflection.Assembly]::LoadFrom("C:\Users\Lx\Downloads\htmlagilitypack.1.11.12\lib\Net45\HtmlAgilityPack.dll")
$html = New-Object -TypeName HtmlAgilityPack.HtmlDocument
$html.LoadHtml($a)
$html.DocumentNode
here is a working example with htmlagilitypack, and core pshtml with classes like in #218
first; loading htmlagilitypack [Reflection.Assembly]::LoadFrom("C:\Users\Lx\Downloads\htmlagilitypack.1.11.12\lib\Net45\HtmlAgilityPack.dll")
then, get html code from your favorite page, copy/paste it inside an html file
fetch the content $a = get-content .\yourhtmlpage.html
and voila:
PS C:\Users\Lx> $x = get-pshtmldocument -html $a
PS C:\Users\Lx> $x
TagName id Class Children
------- -- ----- --------
{$null}
#comment {}
html {, }
PS C:\Users\Lx> $x[2]
TagName id Class Children
PS C:\Users\Lx> $x[2].children[1].children
TagName id Class Children
------- -- ----- --------
script {}
script {var config = { autoCapture: { lineage: true }...
noscript {}
div headerArea uhf {headerRegion}
link {}
link {}
script {}
div page hfeed site {single-wrapper, wrapper-footer}
div a2a_kit a2a_kit_size_32 a2a_floating_style a2a_default_style {, , }
script {var CrayonSyntaxSettings = {"version":"_2.7.2_beta","is_admin":"0...
script {(function (undefined) {var _targetWindow ="prefer-popup"; window....
script {/*{literal}*/window.lightningjs||function(c){function g(b,d){d&&(...
div footerArea uhf {footerRegion}
link {}
link {}
script {}
script {//fix calendar hide when change month var string = window....
script {}
script {window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","...
PS C:\Users\Lx>
the function itself:
function get-pshtmldocument {
param (
$html
)
begin {
function HtmlToPSHTMLClass {
param(
$node
)
If ( $node.nodetype -ne 'Text' ) {
$plop = [htmlParentElement]::New()
$plop.SetTagName($node.Name)
$plop.Id = $node.Attributes.where({$_.name -eq 'id'}).Value
$plop.Class = $node.Attributes.where({$_.name -eq 'class'}).Value
If ( $node.hasChildNodes ) {
foreach ( $n in $node.childnodes ) {
##some nodes are 'empty' so i did this ... maybe a bug ???
If ( $n.nodetype -eq 'Text' -and $n.InnerText.trim() -ne '' ) {
$child = $n.InnerText
$plop.AddChild( $child )
} elseif ( $n.nodetype -ne 'Text') {
$child = HtmlToPSHTMLClass -node $n
$plop.AddChild( $child )
}
}
}
}
$plop
}
}
process {
$document = New-Object -TypeName HtmlAgilityPack.HtmlDocument
$document.LoadHtml($html)
Foreach( $node in $document.DocumentNode.ChildNodes ) {
HtmlToPSHTMLClass -node $node
}
}
end {
}
}
A side note: The HTML Agility Pack (HAP) is MIT licenced. So we could strongly consider it...
Another side note: It looks like Justin Grote already wrote a powershell implementation of the Agility Pack. PowerHTML (Under MIT as well)
It would be nice to have a function which could read a HTML page out, and send an object back, which could be developed further, or even converted to an PSHTML Powershell file (is that utopic?)
1) The parsing
For that, we will need the ability to parse a HTML document.
This snippet might be an option to do so:
2) Create a PSHTML.Document object Once it is parsed (or while parsing) we could create for each html element the corrsponding PSHTML Object. This would assume that this issue is closed and implemented first -> https://github.com/Stephanevg/PSHTML/issues/218