No9 / harmon

middleware for node-http-proxy to modify the remote website response with trumpet
Other
424 stars 62 forks source link

Dealing with bad html #32

Open kouk opened 9 years ago

kouk commented 9 years ago

Depending on the source markup there are times when the trumpet pipeline will not finish processing even though the response has been written entirely (see below). It would be great if there was a way to set a timeout and cancel the pipeline somehow. For example:

var http = require('http'),
    connect = require('connect'),
    through = require('through'),
    httpProxy = require('http-proxy');

var selects = [];
var simpleselect = {};

simpleselect.query = 'body';
simpleselect.func = function (node) {
    var ts = node.createStream();
    ts.pipe(through(null, function(){
        console.log("Injecting");
        this.queue("FOOBAR");
        this.queue(null);
    })).pipe(ts);

    ts.on('end', function() {
     console.log("BODY");
    });
}

selects.push(simpleselect);

var app = connect();

var proxy = httpProxy.createProxyServer({
    target: 'http://localhost:8002'
})

app.use(require('harmon')([], selects, true));

app.use(
  function (req, res) {
    proxy.web(req, res);
  }
);

http.createServer(app).listen(8001);

http.createServer(function(req, res) {
    res.setHeader('Content-Type', 'text/html; charset=UTF-8');
    res.setHeader('Transfer-Encoding', 'chunked');
    res.write('<html><head><title>foo</title></head><body><h1>Foo</h1>');
    /* With this line the request hangs
    res.write('<div><svg><defs><circle /></defs><circle /></svg></div>');
    */
    res.end('</body></html>');
}).listen(8002);

If you run the example and curl http://localhost:8001 it works fine. But if you uncomment the commented line and try again it hangs. I've tracked down the specific underlying reason for this in the html-select module, and I'll raise this specific issue with them. But what I'm interested in is shielding myself from these kinds of bugs (which may be triggered by broken proxied web pages) by doing setTimeout and aborting the trumpet pipeline somehow. I've tried various combinations of res.end() and tr.end() but the stream from tr.createStream just never seems to die. I would appreciate if you could show me a way, perhaps even a totally different approach.

kouk commented 9 years ago

There's also a discussion here which might be related: https://github.com/substack/html-tokenize/issues/13

But all this isn't really future proof, it would be great if somehow we can call ts.end() or node.abort() and have it kill all the streams that might be blocking.

mspl13 commented 9 years ago

I'm running in the same issue using this gist. If you uncomment line 45 and comment 44 out, it won't stop loading and you can't porceed with anything. Couldn't find any solution to this point and can't really think of any way to stop it from loading...

No9 commented 9 years ago

From the notes here I agree this feels like an upstream fix. But thanks for the steps to reproduce.

mspl13 commented 9 years ago

Sorry for seeming impatient (I'm not, just curious) but do you already know what you gonna do about it? Reporting this as a bug to the actual module that is affected or fix it by yourself? Since you added the 'help-wanted' tag, do you wait for somebody you know to help you out with this?

No9 commented 9 years ago

Hi @mspl13 I am not actively looking at this at the moment. Hence the help wanted tag. I would love somebody to help with this if they have the bandwidth.

chandransuresh commented 6 years ago

Hi @No9 I faced issue with bad HTML as well ( "end" event does not get fired for bad HTML) , I ended up adding this code to fix it: Would you let me create a PR with a change similar to the following? (need to get the Max timeout from configuration). Thanks.

let endEmittedInterval;

let endEmittedInterval;
tr.on('finish', () => {
            let intervalCount = 0;
            endEmittedInterval = setInterval(() => {
                try {
                    intervalCount += 1;
                    // make sure read and write are completed after finish
                    if (tr._writableState.ended === true && tr._tokenize._readableState.endEmitted === true) {
                        clearInterval(endEmittedInterval);
                        _end.call(res);
                    } else if (intervalCount >= 6000) {
                        // clear the runway interval after 60 seconds
                        // TODO: change 60 seconds to server's MAXIMUM_TIMEOUT value
                        clearInterval(endEmittedInterval);
                        _end.call(res);
                    }
                } catch (err) {
                    clearInterval(endEmittedInterval);
                    _end.call(res);
                }
            }, 10);
        });
No9 commented 6 years ago

Hi @chandransuresh Thanks for your message.

Do you know if this patch to html-select fix your issue? https://github.com/substack/html-select/pull/23

If it does we might want to fork that along with trumpet and publishing them in there own namespace?

If not then we can consider your workaround above and putting it into the harmon constructor something like

var opts = {};
opts.timeout = 6000;
app.use(require('../')([], selects, opts));

Let me know how you get on

chandransuresh commented 6 years ago

Hi @No9 , Thanks for the response. I tested with the change in kouk:master branch from substack/html-select#23. No, it does not fix the issue of broken HTML. In my sample page, there are bunch of tags that does not have any matching close tags.

chandransuresh commented 6 years ago

Hi @No9 , Sorry for the delay in getting back. Could you give me access to create a PR for malformed HTML fix?. The changed file is in the following clone. https://github.com/chandransuresh/harmon-malformed-html-fix/blob/fix-for-malformed-html/index.js

Thanks, Suresh