Simply-Static / simply-static

Convert your WordPress site into a static one with the free WordPress static site generator plugin, Simply Static.
https://simplystatic.com
GNU General Public License v2.0
200 stars 49 forks source link

Add a hook leveraging SimpleHtmlDom to detect invalid html #133

Closed pixeline closed 1 year ago

pixeline commented 1 year ago

We had an issue today where invalid html (a closing div tag was missing) was pushed to our sinatra-based website. It uses Nokogiri, which is very fault intolerant and, not seeing the missing div, removed the faulty container, thus creating a visual havoc on our public frontend.

I am looking at a potential way to spot this earlier in the chain and am wondering whether Simply Static could perhaps do it, since if I understand correctly, it renders the html using SimpleHtmlDom.

Is there a hook I could use to ask Simply Static to stop whenever an invalid html is detected, and issue a message with the url containing the invalid html code ?

Patrick by email told me:

not exactly a filter or action, but you may use the SimpleHtmlDom package to your advantage here. Simply Static has a method called extract_urls_and_update_tag() (src/class-ss-url-extractor.php) which handles the extraction of URLs from HTML tags. The tag itself is an object that contains all the information about the tag that got parsed with SimpleHTMLDom. So if there is an error here, that's most likely because of invalid HTML. That was also one of the reasons we use $tag->innerhtmlKeep instead of $tag->innerhtml internally now (the second one throws an expectation on non-valid HTML).

I made a simple function that checks whether the html is well formed (all open tags are closed).

 /**
  * We need to make sure the static html is valid html.
  * This hack makes sure the generation stops if the html is invalid.
  * inspired by https://stackoverflow.com/a/3167315/53960
  * example use:
  * $test = is_html_valid('<html><body><h1>Test</h1><p>yollo</p></body></html>');
  * @return bool
  */
if (!function_exists('is_html_valid')) {
    function is_html_valid($html)
    {
        $start = strpos($html, '<');
        $end = strrpos($html, '>', $start);

        $len = strlen($html);

        if ($end !== false) {
            $html = substr($html, $start);
        } else {
            $html = substr($html, $start, $len - $start);
        }
        libxml_use_internal_errors(true);
        libxml_clear_errors();
        $xml = simplexml_load_string($html);
        return count(libxml_get_errors()) == 0;
    }
}

Would it be okay for you to add a filter so that I can hook into it, so that I could do something like "interrupt the static process and add a final message with a description of the issue (url of the invalid html) in the SSP Console ?

When this works, I can definitely contribute this as a gist so you can showcase it on the plugin website. I feel it would solve a real business need if the Simply Static static generation process is able to stop when potential website defacing is pushed. Especially in the age of CI/CD, in a WordPress ecosystem full of plugins of various quality...

Thank you for your support!

igorbenic commented 1 year ago

@pixeline check PR: https://github.com/Simply-Static/simply-static/pull/140

That should be enough.

Currently, we already have the action there, but not the methods set there to access protected data.

pixeline commented 1 year ago

@igorbenic Thanks! I just tested it and it works super well!

I just noticed something a bit strange: the log shows the error message, but the final log does not show it.

ssp-message-1

ssp-message-2

The code I used:


add_action( 'ss_after_extract_and_replace_urls_in_html', 'cancel_if_invalid_html', 20, 2 );

function cancel_if_invalid_html( $dom, $extractor ) {
  if ( !is_html_valid( $dom->innerHtml )) {
    $message = 'Invalid HTML on URL ' . $extractor->get_static_page()->url;
    // Abort the job if the HTML is invalid.
     \Simply_Static\Plugin::instance()->get_archive_creation_job()->cancel( $message );
  }
}

if (!function_exists('is_html_valid')) {
    function is_html_valid($html)
    {
        $start = strpos($html, '<');
        $end = strrpos($html, '>', $start);

        $len = strlen($html);

        if ($end !== false) {
            $html = substr($html, $start);
        } else {
            $html = substr($html, $start, $len - $start);
        }
        libxml_use_internal_errors(true);
        libxml_clear_errors();
        $xml = simplexml_load_string($html);
        return count(libxml_get_errors()) == 0;
    }
}

Not sure what is going on there. I guess the "parent" job took over the log entry slot and overwrote the message ?