ekalinin / robots.js

Parser for robots.txt for node.js
MIT License
66 stars 21 forks source link

Always consume response data in order for connection to close. #9

Closed bsander closed 11 years ago

bsander commented 11 years ago

First I'd like to thank you for this nice library, it's saved me quite some time in my project!

I found a small bug though, which I've addressed. Consider the following command line app built with node:

#!/usr/bin/env node

var RobotsParser = require('robots').RobotsParser;

new RobotsParser('http://localhost/robots.txt', '', function (parser, success) {
    if (success) {
        console.log('Robots.txt parsed');
    } else {
        console.log('Failed to parse robots.txt');
    }
});

Assuming localhost does not have a /robots.txt file and thus sends a 404 header, the parser object will be set to an allowAll state and the specified callback method is invoked. However, if you run this from the command line you'll notice the script doesn't actually exit for a while. This is because the requests's response data is never consumed, and so the request object is still hanging around waiting to emit its 'end' event.

According to the node.js docs:

If no 'response' handler is added, then the response will be entirely discarded. However, if you add a 'response' event handler, then you must consume the data from the response object, either by calling response.read() whenever there is a 'readable' event, or by adding a 'data' handler, or by calling the .resume() method. Until the data is consumed, the 'end' event will not fire.

This pull request calls the readable.resume() method a few times in appropriate places so that the connection can gracefully be closed after having served its purpose.

ekalinin commented 11 years ago

First I'd like to thank you for this nice library, it's saved me quite some time in my project!

Cool! I'm very glad it was useful.

I found a small bug though, which I've addressed.

Patched! Thanks!