danmactough / node-feedparser

Robust RSS, Atom, and RDF feed parsing in Node.js
Other
1.97k stars 192 forks source link

Unhandled incorrect header/body check #277

Closed synzen closed 5 years ago

synzen commented 5 years ago

Here I used node-fetch to with Accept-Encoding: gzip,deflate (the default) for a feed that sends a erroneously sends back a response with content-encoding: gzip when the body is not actually compressed. Unfortunately it seems that feedparser doesn't detect such a problem here. node-fetch doesn't throw any errors for this unless you call res.text(), res.json() (see https://github.com/bitinn/node-fetch/issues/45), and instead I immediately pipe it to feedparser rather than letting node-fetch handle it

In my example below feedparser doesn't emit error and doesn't emit end.

feedparser v2.2.9 Node v12.7.0

Feed: https://eliteesp.es/category/galnet/feed/

const FeedParser = require('feedparser')
const fetch = require('node-fetch')
const feedparser = new FeedParser()

fetch('https://eliteesp.es/category/galnet/feed/').then(res => {
  if (res.status !== 200) throw new Error('Non-200 status code')
  res.body.pipe(feedparser)
})

feedparser.on('error', function (error) {
  console.log('fp err', error)
  // always handle errors
});

feedparser.on('readable', function () {
  // This is where the action is!
  var stream = this; // `this` is `feedparser`, which is a stream
  var meta = this.meta; // **NOTE** the "meta" is always available in the context of the feedparser instance
  var item;

  while (item = stream.read()) {
    console.log(item);
  }
})

feedparser.on('end', function () {
  console.log('end')
})

There is no output when running this. Not sure if feedparser is supposed to handle this case though

rdbcci commented 5 years ago

Not sure what your asking, but if fetch returns encoded (gzip) you need to decode it before piping to feedparser. When I do this using a mikeal request, this feed parses just fine. Note, this feed is not encoded.

synzen commented 5 years ago

node-fetch doesn't auto-decode the body unless prompted by the user (which is why it's lightweight and faster from my own tests - like the browser's fetch API) - instead, I let feedparser take on the body. node-fetch works over 99.999% of the time (it works reliably, and is lighter than other libs!) until I encountered this issue, where feedparser failed to detect an invalid feed for a feed like this (had me scratching me head for days as to what was happening).

You can see the link above has invalid content-encoding headers when you open it in chrome and you see ERR_CONTENT_DECODING_FAILED

image

and the explanation for it: https://stackoverflow.com/questions/14039804/error-330-neterr-content-decoding-failed. TLDR; response says its gzipped, and it's actually not - browser is confused. I assume feedparser is confused here as well, but instead of giving an error, feedparser just hangs indefinitely

rdbcci commented 5 years ago

Not a node-fetch user, but parses fine as a mikeal request (feed header in this case indicates that is not encoded).

synzen commented 5 years ago

I get that request works, as you've said before, but I'm saying the problem is that feedparser isn't emitting any errors here, as I mentioned in my last sentence

rdbcci commented 5 years ago

Yea, just catching up to subtlety of your problem.

rdbcci commented 5 years ago

I just set myself up to use node-fetch. When I run it with your feed and {compress:false} option, it works.

synzen commented 5 years ago

Yes it will work in that case because then their server will send an uncompressed body, with no gzip in content-encoding (the main problem with their feed)

but the point here is that feedparser is not emitting an error here when I really think it should (if it's possible) - the issue is basically being swallowed up by the parser. I'm not trying to find a way to work with the feed, just trying to suggest that feedparser should emit an error so

  1. Someone else doesn't spend hours scratching their heads why a feed may never resolve through feedparser
  2. To handle it appropriately in error handling
rdbcci commented 5 years ago

I think Feedparser is just waiting for input. I think node-fetch maybe hung, not handling the decoding correctly in your offending feeds case. If that's the case Feedparser may not be the place for the fix.

danmactough commented 5 years ago

@synzen This doesn't appear to be related to feedparser.

Make the following change to your example script:

- res.body.pipe(feedparser)
+ res.body.pipe(process.stdout)

You'll still see the pathological behavior -- no output, no errors. 🤷‍♂