Open nvaken opened 8 years ago
bumping for same issue. Crawling a share point site with tons of links.
Same issue for me. This happened when the site being tested added a large (9 MB) video. So I don't think it is the number of resources, for me, but the size
If there is no fix/workaround, I'm gonna have to stop using the link checker.
The config options are here: https://github.com/cgiffard/node-simplecrawler#configuration But none of them allow me to fix/workaround my issue
Seems like one of these should do it, but I can't et them to work crawler.maxResourceSize=16777216 - The maximum resource size that will be downloaded, in bytes. Defaults to 16MB. I tried maxResourceSize: or 2MB to 32MB and got no difference in behavior Similarly downloadUnsupported: false has no affect
Doesn't seem to be a config option to ignore some file types, unless there is a way to use Fetch Conditions. Not clear this is possible.
Probably going to stop using this :(
I was able to fix this by using fetch conditions to ignore the movie that was causing the problem. My grunt file (coffee script) looks like this:
linkChecker:
build:
site: 'localhost',
options:
initialPath: '/site-dir.html'
maxConcurrency: 20
initialPort: 8000
supportedMimeTypes: [/text\/html/, /text\/css/]
callback: (crawler)=>
crawler.addFetchCondition((url)=>
return !url.path.match(/\.mp4$/i)
)
Not sure, as I can not check this as we speak, though I'm pretty sure my original error isn't caused by one big resource. I'm pretty sure the projects that I'm checking do not have bigger resources then say ~5 MB and that would even be a anomaly. Your fix seems to do the job for specific big resources (which is good to have in here! 👍 ), though, it will probably not fix my original issue.
So, that being said, I'm still looking for answers. 😊
Probably because the site that I'm crawling has an pretty high amount of resources. Though, I wonder if this isn't preventable? Am I overlooking an option here?