Closed tibetty closed 8 years ago
There might be something wrong w/ the binary resource handling. You can find out easily by crawling following url: http://www.jnto.go.jp/eng/pdf/regional/tokyo/taito_ku.pdf. The actually (and also correct) size of this pdf is 4,948,741 bytes, while the length of page.url is only 4,761,100 bytes.
Hi, thank you for reporting the issue. From the description it looks like the response is not fully read/the stream is closed prematurely, might be an issue with the request
module or with how it is used, I will take a look.
I created a small local test environment where I am trying to reproduce this problem, and for some reason the content length for the PDF file I get in the browser (Firefox) and the crawler when using the request
module is different which is probably the reason why the file contents is corrupted:
Firefox: Content-Length = 4948741 request: content length = 4761100
To be further investigated.
The following snippet that uses the request
module in the same way like js-crawler
does, reproduces the problem:
var request = require('request');
var url = 'http://www.jnto.go.jp/eng/pdf/regional/tokyo/taito_ku.pdf';
request({url: url}, function(error, response, body) {
if (!error && response.statusCode == 200) {
console.log('body.length = ', body.length);
} else {
console.log('error = ', error);
}
});
This looks like a bug in the request
module. What is quite strange is that the returned content length is always the same: 4761100 bytes, while in reality it should be 4948741 bytes.
Maybe there is some alternative API that can be used. I will try to investigate this issue more, there might be some workaround and we may also report a bug for request
.
Thanks for your timely response, @antivanov ! I did some web searching job over the internet, and it seems that request requests resource using utf-8 encoding, and streams everything received to body as is utf-8 string, however, binary resource simply ignore encoding meta and give back binary data directly. So there might be a work around here, 1. intercept "response" event and separate binary resources from text one 2. push back binary resource to binary crawling queue and handle them in a different way (use Buffer rather than string to store data, need to concatenate chunks piece by piece), etc. let me see if I can do something for you. Regards, tibetty
Had this issue addressed by making some minor changes including:
Please kindly let me know how to share the revised crawler.js w/ you. BTW, right now the regexp based link analysis is agile but ways a bit too simple that might lead to below errors:
I like the regexp solution but you might enhance it a bit in future release.
Had this issue addressed by making some minor changes including: when request any url, add encoding = null to options, so all received content will be a buffer; before turn the body to further link analysis, ensure it's html and stringify it w/ correct encoding; Please kindly let me know how to share the revised crawler.js w/ you.
Sounds like this can be a proper solution. Can you, please, create a PR? I will then quickly review and merge it to the main branch.
BTW, right now the regexp based link analysis is agile but ways a bit too simple that might lead to > below errors:
- a-href within comments will be taken as a valid url;
- other hrefs like css/js may also be required to crawled; esp. js file, further link analysis may also > needed as well.
If there is an idea how to improve the regex handling, this can be a separate PR that can be merged to the main branch. The scenarios that you describe make sense.
In the coming weeks I plan to work more on the crawler, add tests, re-factor it a bit and publish an updated version after that. Any feedback and improvements can be included.
New PR created, please spare your time to review the changes I made to crawler.js. Actually I already gave out one improvement suggestion:
Closing the issue as the corresponding PR has been merged https://github.com/antivanov/js-crawler/pull/19 @tibetty thank you for the contribution
When I try to write the content into a file directly using code snippet as below fs.writeFileSync(filepath, page.content, 'binary');
It always come out a corrupted file. Can you help to look at this issue?
Many thanks! tibetty