Stephanevg / PSHTML

Cross platform Powershell module to generate HTML markup language
https://pshtml.readthedocs.io/en/latest/
Other
168 stars 43 forks source link

Consider Get-PSHTMLDocument #250

Open Stephanevg opened 5 years ago

Stephanevg commented 5 years ago

It would be nice to have a function which could read a HTML page out, and send an object back, which could be developed further, or even converted to an PSHTML Powershell file (is that utopic?)

1) The parsing

For that, we will need the ability to parse a HTML document.

This snippet might be an option to do so:

Add-Type -AssemblyName System.Xml.Linq
$txt=[IO.File]::ReadAllText("c:\myhtml.html")
$xml = [System.Xml.Linq.XDocument]::Parse($txt)
$ns='http://www.w3.org/1999/xhtml'
$divs=$cells = $xml.Descendants("{$ns}td")

2) Create a PSHTML.Document object Once it is parsed (or while parsing) we could create for each html element the corrsponding PSHTML Object. This would assume that this issue is closed and implemented first -> https://github.com/Stephanevg/PSHTML/issues/218

LxLeChat commented 5 years ago

Hi, I think it's better to use something really dedicated to html. System.xml is xml focused. I tried with the small function i created in #218 and it fails when some "special" html syntaxes are used ( atom stuff .. ).

I tried with the htmlagilitypack and ... well it's html oriented html, and it's almost the same. it also works on pscore (6.2)

it's available here: https://html-agility-pack.net (download the nuget package, and unzip it somewhere )

[Reflection.Assembly]::LoadFrom("C:\Users\Lx\Downloads\htmlagilitypack.1.11.12\lib\Net45\HtmlAgilityPack.dll")
$html = New-Object -TypeName HtmlAgilityPack.HtmlDocument
$html.LoadHtml($a)
$html.DocumentNode
LxLeChat commented 5 years ago

here is a working example with htmlagilitypack, and core pshtml with classes like in #218 first; loading htmlagilitypack [Reflection.Assembly]::LoadFrom("C:\Users\Lx\Downloads\htmlagilitypack.1.11.12\lib\Net45\HtmlAgilityPack.dll")

then, get html code from your favorite page, copy/paste it inside an html file fetch the content $a = get-content .\yourhtmlpage.html

and voila:

PS C:\Users\Lx> $x = get-pshtmldocument -html $a
PS C:\Users\Lx> $x

TagName  id Class Children
-------  -- ----- --------
                  {$null}
#comment          {}
html              {, }    

PS C:\Users\Lx> $x[2]

TagName id Class Children

PS C:\Users\Lx> $x[2].children[1].children

TagName  id         Class                                                        Children
-------  --         -----                                                        --------
script                                                                           {}
script                                                                           {var config = {     autoCapture: {             lineage: true     }... 
noscript                                                                         {}
div      headerArea uhf                                                          {headerRegion}
link                                                                             {}
link                                                                             {}
script                                                                           {}
div      page       hfeed site                                                   {single-wrapper, wrapper-footer}
div                 a2a_kit a2a_kit_size_32 a2a_floating_style a2a_default_style {, , }
script                                                                           {var CrayonSyntaxSettings = {"version":"_2.7.2_beta","is_admin":"0... 
script                                                                           {(function (undefined) {var _targetWindow ="prefer-popup"; window.... 
script                                                                           {/*{literal}*/window.lightningjs||function(c){function g(b,d){d&&(... 
div      footerArea uhf                                                          {footerRegion}
link                                                                             {}
link                                                                             {}
script                                                                           {}
script                                                                           {//fix calendar hide when change month        var string = window.... 
script                                                                           {}
script                                                                           {window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","... 

PS C:\Users\Lx>

the function itself:

function get-pshtmldocument {
    param (
        $html
    )

    begin {

       function HtmlToPSHTMLClass {
            param(
                $node
            )

            If ( $node.nodetype -ne 'Text' ) {

                $plop = [htmlParentElement]::New()
                $plop.SetTagName($node.Name)
                $plop.Id = $node.Attributes.where({$_.name -eq 'id'}).Value
                $plop.Class = $node.Attributes.where({$_.name -eq 'class'}).Value

                If ( $node.hasChildNodes ) { 
                    foreach ( $n in $node.childnodes ) {
##some nodes are 'empty' so i did this ... maybe a bug ???
                        If ( $n.nodetype -eq 'Text' -and $n.InnerText.trim() -ne '' ) {
                            $child = $n.InnerText
                            $plop.AddChild( $child )
                        } elseif ( $n.nodetype -ne 'Text') {
                            $child = HtmlToPSHTMLClass -node $n
                            $plop.AddChild( $child )
                        }
                    }
                }
            }

            $plop
        } 

    }

    process {

        $document = New-Object -TypeName HtmlAgilityPack.HtmlDocument
        $document.LoadHtml($html)

        Foreach( $node in $document.DocumentNode.ChildNodes ) {
            HtmlToPSHTMLClass -node $node
        }

    }

    end {

    }

}
Stephanevg commented 4 years ago

A side note: The HTML Agility Pack (HAP) is MIT licenced. So we could strongly consider it...

Stephanevg commented 6 months ago

Another side note: It looks like Justin Grote already wrote a powershell implementation of the Agility Pack. PowerHTML (Under MIT as well)