danmactough / node-feedparser

Robust RSS, Atom, and RDF feed parsing in Node.js
Other
1.97k stars 190 forks source link

Parser crashes out (SAX error?) on this valid RSS feed #215

Closed dsl101 closed 7 years ago

dsl101 commented 7 years ago

Simple test case (I'm using feedparser-promised as a wrapper, but that shouldn't matter):

FeedParser.parse({
  uri: 'http://www.arctic-council.org/index.php/en/?format=feed&type=rss',
  timeout: 5000,
  addmeta: false,
  dateAsString: true
}).then(() => {
  console.log('OK')
}).catch(e => {
  console.log('error:', e)
})

returns this rather odd error:

error: Error: Unexpected end
Line: 7
Column: 173
Char:
    at error (/node/lambda/feedfilter/node_modules/sax/lib/sax.js:667:10)
    at end (/node/lambda/feedfilter/node_modules/sax/lib/sax.js:678:7)
    at Object.end (/node/lambda/feedfilter/node_modules/sax/lib/sax.js:154:24)
    at SAXStream.end (/node/lambda/feedfilter/node_modules/sax/lib/sax.js:248:18)
    at FeedParser._flush (/node/lambda/feedfilter/node_modules/feedparser/lib/feedparser/index.js:1087:17)
    at FeedParser.<anonymous> (/node/lambda/feedfilter/node_modules/feedparser/node_modules/readable-stream/lib/_stream_transform.js:115:49)
    at FeedParser.g (events.js:291:16)
    at emitNone (events.js:86:13)
    at FeedParser.emit (events.js:185:7)
    at prefinish (/node/lambda/feedfilter/node_modules/feedparser/node_modules/readable-stream/lib/_stream_writable.js:494:12)

That feed passes the W3C validator, so I'm not sure what problem is.

danmactough commented 7 years ago

@dsl101 I don't know what the feedparser-promised wrapper is doing, but it is not unzipping the response. When I curl that url, I can see that it is being served gzipped. It needs to be unzipped.

curl -i "http://www.arctic-council.org/index.php/en/?format=feed&type=rss"

HTTP/1.1 200 OK
Server: nginx
Date: Fri, 28 Apr 2017 18:30:09 GMT
Content-Type: application/rss+xml; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Content-Encoding: gzip
Expires: Wed, 17 Aug 2005 00:00:00 GMT
ETag: http://www.arctic-council.org/index.php/en/?format=feed&type=rss
Cache-Control: no-cache
Pragma: no-cache
Set-Cookie: 9b25b2ceec56fcde5bd7ce6373ce6f76=bp2mueorpmsfdn3kmvucgkqof4; path=/; HttpOnly
Last-Modified: Fri, 28 Apr 2017 18:27:09 GMT
Host-Header: 192fc2e7e50945beb8231a492d6a8024
X-Proxy-Cache: MISS

# ...
dsl101 commented 7 years ago

All the wrapper does is wrap the callback in a promise. It passes the URL on to feedparser unchanged.

Does that mean feedparser doesn't handle zipped feeds?

On Fri, 28 Apr 2017 at 19:32, Dan MacTough notifications@github.com wrote:

@dsl101 https://github.com/dsl101 I don't know what the feedparser-promised wrapper is doing, but it is not unzipping the response. When I curl that url, I can see that it is being served gzipped. It needs to be unzipped.

curl -i "http://www.arctic-council.org/index.php/en/?format=feed&type=rss"

HTTP/1.1 200 OK Server: nginx Date: Fri, 28 Apr 2017 18:30:09 GMT Content-Type: application/rss+xml; charset=utf-8 Transfer-Encoding: chunked Connection: keep-alive Content-Encoding: gzip Expires: Wed, 17 Aug 2005 00:00:00 GMT ETag: http://www.arctic-council.org/index.php/en/?format=feed&type=rss Cache-Control: no-cache Pragma: no-cache Set-Cookie: 9b25b2ceec56fcde5bd7ce6373ce6f76=bp2mueorpmsfdn3kmvucgkqof4; path=/; HttpOnly Last-Modified: Fri, 28 Apr 2017 18:27:09 GMT Host-Header: 192fc2e7e50945beb8231a492d6a8024 X-Proxy-Cache: MISS

...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/danmactough/node-feedparser/issues/215#issuecomment-298073680, or mute the thread https://github.com/notifications/unsubscribe-auth/AA29vkUL_Z_HxmEDI1gvTAB7U0JSRhWrks5r0jC7gaJpZM4NLUxa .

danmactough commented 7 years ago

All the wrapper does is wrap the callback in a promise. It passes the URL on to feedparser unchanged.

feedparser only parses feeds. It doesn't do anything else. So, that wrapper must be fetching the url. Before you pass the data to feedparser, you just need to unzip it. You should be able to adapt the example.

dsl101 commented 7 years ago

Ah, ok. I've not looked under the hood. Will have a proper look next week.

Many thanks,

On Fri, 28 Apr 2017 at 20:16, Dan MacTough notifications@github.com wrote:

All the wrapper does is wrap the callback in a promise. It passes the URL on to feedparser unchanged.

feedparser only parses feeds. It doesn't do anything else. So, that wrapper must be fetching the url. Before you pass the data to feedparser, you just need to unzip it. You should be able to adapt the example https://github.com/danmactough/node-feedparser/blob/master/examples/compressed.js .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/danmactough/node-feedparser/issues/215#issuecomment-298084078, or mute the thread https://github.com/notifications/unsubscribe-auth/AA29vscU4juzL7y0NmHZrdvwZ8b_eWOhks5r0jsrgaJpZM4NLUxa .