amoilanen / js-crawler

Web crawler for Node.JS
MIT License
253 stars 55 forks source link

What the content exactly is when the requested resource are binary, e.g., images or pdf file? #18

Closed tibetty closed 8 years ago

tibetty commented 8 years ago

When I try to write the content into a file directly using code snippet as below fs.writeFileSync(filepath, page.content, 'binary');

It always come out a corrupted file. Can you help to look at this issue?

Many thanks! tibetty

tibetty commented 8 years ago

There might be something wrong w/ the binary resource handling. You can find out easily by crawling following url: http://www.jnto.go.jp/eng/pdf/regional/tokyo/taito_ku.pdf. The actually (and also correct) size of this pdf is 4,948,741 bytes, while the length of page.url is only 4,761,100 bytes.

amoilanen commented 8 years ago

Hi, thank you for reporting the issue. From the description it looks like the response is not fully read/the stream is closed prematurely, might be an issue with the request module or with how it is used, I will take a look.

amoilanen commented 8 years ago

I created a small local test environment where I am trying to reproduce this problem, and for some reason the content length for the PDF file I get in the browser (Firefox) and the crawler when using the request module is different which is probably the reason why the file contents is corrupted:

Firefox: Content-Length = 4948741 request: content length = 4761100

To be further investigated.

amoilanen commented 8 years ago

The following snippet that uses the request module in the same way like js-crawler does, reproduces the problem:

var request = require('request');
var url = 'http://www.jnto.go.jp/eng/pdf/regional/tokyo/taito_ku.pdf';

request({url: url}, function(error, response, body) {
  if (!error && response.statusCode == 200) {
    console.log('body.length = ', body.length);
  } else {
    console.log('error = ', error);
  }
});

This looks like a bug in the request module. What is quite strange is that the returned content length is always the same: 4761100 bytes, while in reality it should be 4948741 bytes.

Maybe there is some alternative API that can be used. I will try to investigate this issue more, there might be some workaround and we may also report a bug for request.

tibetty commented 8 years ago

Thanks for your timely response, @antivanov ! I did some web searching job over the internet, and it seems that request requests resource using utf-8 encoding, and streams everything received to body as is utf-8 string, however, binary resource simply ignore encoding meta and give back binary data directly. So there might be a work around here, 1. intercept "response" event and separate binary resources from text one 2. push back binary resource to binary crawling queue and handle them in a different way (use Buffer rather than string to store data, need to concatenate chunks piece by piece), etc. let me see if I can do something for you. Regards, tibetty

tibetty commented 8 years ago

Had this issue addressed by making some minor changes including:

Please kindly let me know how to share the revised crawler.js w/ you. BTW, right now the regexp based link analysis is agile but ways a bit too simple that might lead to below errors:

  1. a-href within comments will be taken as a valid url;
  2. other hrefs like css/js may also be required to crawled; esp. js file, further link analysis may also needed as well.

I like the regexp solution but you might enhance it a bit in future release.

amoilanen commented 8 years ago

Had this issue addressed by making some minor changes including: when request any url, add encoding = null to options, so all received content will be a buffer; before turn the body to further link analysis, ensure it's html and stringify it w/ correct encoding; Please kindly let me know how to share the revised crawler.js w/ you.

Sounds like this can be a proper solution. Can you, please, create a PR? I will then quickly review and merge it to the main branch.

BTW, right now the regexp based link analysis is agile but ways a bit too simple that might lead to > below errors:

  1. a-href within comments will be taken as a valid url;
  2. other hrefs like css/js may also be required to crawled; esp. js file, further link analysis may also > needed as well.

If there is an idea how to improve the regex handling, this can be a separate PR that can be merged to the main branch. The scenarios that you describe make sense.

In the coming weeks I plan to work more on the crawler, add tests, re-factor it a bit and publish an updated version after that. Any feedback and improvements can be included.

tibetty commented 8 years ago

New PR created, please spare your time to review the changes I made to crawler.js. Actually I already gave out one improvement suggestion:

amoilanen commented 8 years ago

Closing the issue as the corresponding PR has been merged https://github.com/antivanov/js-crawler/pull/19 @tibetty thank you for the contribution