Don't stop reading prematurarly when looking ahead

Filling our reader buffers is not predictable, because InputStreams might deliver fewer data than maximal possible. This made the buffer sometimes look like it was at the end of all data, but it was not, rather just not fully filled.

I have not yet found a test case for it and how to write it. This all has been tested with real websites and XLT. The code where it mostly failed, but only when the buffer reached its "pending" state, is:

<div id="primary" class="primary-content">

<!-- CQuotient Activity Tracking (viewProduct-cquotient.js) -->
<script type="text/javascript">//<!--
/* <![CDATA[ */
(function(){
    try {
        if(window.CQuotient) {
            var cq_params = {};
            cq_params.product = {
                    id: '016304000',
                    sku: '',
                    type: '',
                    alt_id: ''
                };
            cq_params.realm = "ABCD";
            cq_params.siteId = "ACMECorp";
            cq_params.instanceType = "prd";
            window.CQuotient.trackViewProduct(cq_params);
        }
    } catch(err) {}
})();
/* ]]> */
// -->
</script>

<div id="pdpMain" class="pdp-main pdp-main-container" itemscope itemtype="https://schema.org/Product">
...

We have not identified the </script> correctly because we just saw / and not /script. So, this code itself is parsed just fine but only when the buffer is still sufficient and we have not underread the inputstream. The effect was that we included all HTML after into the comment or script content, rather than the DOM tree.

I suggest we consider a new stream and buffer design that takes the manual evaluation of the state and load behavior out of the parser code. Instead, the stream get suitable methods such as mark/rewind/readAhead and more. Will follow up on that idea soon.

HtmlUnit / htmlunit-neko

Don't stop reading prematurarly when looking ahead #98