danmactough / node-feedparser

Robust RSS, Atom, and RDF feed parsing in Node.js
Other
1.97k stars 192 forks source link

Feedparser "end" event called twice for bad feeds #186

Closed cdellinger closed 7 years ago

cdellinger commented 7 years ago

I have an app that crawls a variety of RSS feeds. Every so often a feed that previously worked will return a "Not a feed" error, usually because the site has exceeded its bandwidth and its returning some html saying the site is not available which is obviously not in the form of valid rss. My problem in those situations is the feedparser "end" event appears to be fired twice, once with the error and once without an error. I've distilled down the code to the following snippit where it tries to call against a site that returns html but not in a valid rss format (http://www.google.com). When I run it, I get a call to feedparser.on('end') once with the error, quickly followed up with another call where there is no error.

var FeedParser = require('feedparser');
var request = require('request');

var req = request('http://www.google.com');
var feedparser = new FeedParser();

req.on('error', function (error) {
    console.log('Error in request');
    feedparser.emit('end', error);
});
req.on('response', function (res) {
    console.log('Inside request response...');
    var stream = this;
    if (res.statusCode != 200) return this.emit('error', new Error('Bad status code (' + res.statusCode + ')'));
    stream.pipe(feedparser);
});

feedparser.on('end', function(err){
    console.log('Inside feedparser end...');
    if (err){
        console.log('Error was passed to feedparser end event');
        console.log('----------');
        console.log(err);
        console.log('----------');
    }
    else{
        console.log('No error was passed to feedparser end event...');
    }
});
feedparser.on('error', function(error) {
    console.log('Inside error for feedparser');
    //console.log(error);
    this.emit('end', error);
});
feedparser.on('readable', function() {
    console.log('Inside feedparser readable event...');
});
fiznool commented 7 years ago

I think the fact the end event is being fired twice is because, from the code snippet above, there is a

this.emit('end', error);

inside the error handler. If this is removed, the error handler is called, followed by the end handler (although no error is passed to end). Perhaps this is expected behaviour?

cdellinger commented 7 years ago

Thanks @fiznool for the response. I had that line there to force an end, but not sure if I truly need it. Unfortunately for this issue, it doesn't matter, I get the same behavior if I comment that line out. The issue I'm facing doesn't happen due to an error in the request, in my example the call to http://www.google.com is valid and the request object executes the request successfully. The error is raised inside feedparser since the contents of google.com are not valid rss (or atom). I expect the first call to the feedparser "end" event where the error (Not a Feed) is passed, I just don't understand why its executed a second time without an error being passed.

danmactough commented 7 years ago

@cdellinger Which node version and feedparser version?

Here's the output I see (node/4.x, feedparser/HEAD):

Inside request response...
Inside error for feedparser
Inside feedparser end...
Error was passed to feedparser end event
----------
[Error: Not a feed]
----------
Inside feedparser readable event...
cdellinger commented 7 years ago

Node: 6.9.1 Feedparser: 1.1.5

Output:

Inside request response...
Inside error for feedparser
Inside feedparser end...
Error was passed to feedparser end event
----------
Error: Not a feed
    at FeedParser.handleEnd (/Users/cdellinger/node_modules/feedparser/main.js:123:13)
    at emitNone (events.js:86:13)
    at SAXStream.emit (events.js:185:7)
    at Object.SAXStream._parser.onend (/Users/cdellinger/node_modules/sax/lib/sax.js:184:8)
    at emit (/Users/cdellinger/node_modules/sax/lib/sax.js:615:33)
    at end (/Users/cdellinger/node_modules/sax/lib/sax.js:654:3)
    at Object.end (/Users/cdellinger/node_modules/sax/lib/sax.js:149:24)
    at SAXStream.end (/Users/cdellinger/node_modules/sax/lib/sax.js:234:16)
    at FeedParser._flush (/Users/cdellinger/node_modules/feedparser/main.js:1066:17)
    at FeedParser.<anonymous> (/Users/cdellinger/node_modules/readable-stream/lib/_stream_transform.js:135:12)
----------
Inside feedparser end...
No error was passed to feedparser end event...
danmactough commented 7 years ago

Ok, this took a lot of sleuthing ... because I had a more recent readable-stream version installed for testing, and -- surprise! -- there's a difference. 😄

Bottom line: what should happen with a bad feed (if you're expecting the current stream API) is (1) error event is emitted and (2) end event is not emitted. But the version of readable-stream we're currently using is the Streams 2 (a/k/a Node 0.10.x) API. In other words, this behavior is as-designed, but out-of-step with the current stream API. Time for a major version bump for feedparser.

cdellinger commented 7 years ago

Hmm, so let me make sure I understand this, the behavior we should see for this scenario is an error event is emitted and an end event is not emitted, is that correct? I see that with the code example from above with the following output:

Inside request response...
Inside error for feedparser
Inside feedparser readable event...

The weird thing is if put any logic to process the stream in the readable event, I see the end event be emitted. So if I make the readable event as follows I will get the following output:

feedparser.on('readable', function() {
    console.log('Inside feedparser readable event...');
    var post;
    while (post = this.read()) {
      console.log(post);
    }
});

Output

Inside request response...
Inside error for feedparser
Inside feedparser readable event...
Inside feedparser end...
No error was passed to feedparser end event...

Is that behavior expected? Should the readable event check to see if an error occurred and not process if that was the case?