evanderkoogh / node-sitemap-stream-parser

A streaming parser for sitemap files. Is able to deal with deeply nested sitemaps with 100+ million urls in them.
Apache License 2.0
38 stars 18 forks source link

Parser does not finish for booking.com #10

Closed YarnSeemannsgarn closed 6 years ago

YarnSeemannsgarn commented 6 years ago

I tried the parser for booking.com with the following code:

var sitemaps = require('sitemap-stream-parser');

sitemaps.sitemapsInRobots('http://booking.com/robots.txt', function(err, urls) {
    if(err || !urls || urls.length == 0)
        return;
    sitemaps.parseSitemaps(urls, console.log, function(err, sitemaps) {
        console.log(sitemaps);
    });
});

The parser runs a while, but then stops with the following error

internal/streams/legacy.js:57
throw er; // Unhandled stream error in pipe.
^

Error: ESOCKETTIMEDOUT
at ClientRequest.<anonymous> (.../node_modules/request/request.js:812:19)
at Object.onceWrapper (events.js:275:13)
at ClientRequest.emit (events.js:182:13)
at TLSSocket.emitTimeout (_http_client.js:694:34)
at Object.onceWrapper (events.js:275:13)
at TLSSocket.emit (events.js:182:13)
at TLSSocket.Socket._onTimeout (net.js:447:8)
at ontimeout (timers.js:427:11)
at tryOnTimeout (timers.js:289:5)
at listOnTimeout (timers.js:252:5)
knoxcard commented 6 years ago

I think this is the issue.

https://github.com/request/request/issues/2047

knoxcard commented 6 years ago

https://github.com/evanderkoogh/node-sitemap-stream-parser/pull/12

YarnSeemannsgarn commented 6 years ago

Actually my code worked after some tries. The problem is that if the internet connection gets interrupted, this error will appear. It is not catchable from my code and needs to be handled in this repository:

https://github.com/evanderkoogh/node-sitemap-stream-parser/blob/33ba4d9d958783e6f4598ab64e6ad0644da3d22f/index.coffee#L17

evanderkoogh commented 6 years ago

That is a good point @YarnSeemannsgarn. I was only listening for errors on the parsing, not on the network stream. I just pushed a fix for this. You will get a regular callback with the network error.

Thanks for reporting this!

https://github.com/evanderkoogh/node-sitemap-stream-parser/tree/7d1a0491ad2d7f48bc5a1c5c84e1bd44f175e262

Just published version 1.4.0 to npm

knoxcard commented 6 years ago

"In my case I was able to workaround the problem by setting the Connection: keep-alive header." https://github.com/request/request/issues/2047

There is some more good stuff!

knoxcard commented 6 years ago

I see that header is already integrated..nice