Open lanzer opened 9 years ago
lots of good stuff in here! thanks
The fix was actually for thresher and not quickscrape. I pushed the changes and it seem to have merged them with my last pull request for another bug, I'm a totally noob so I might have gotten the procedure wrong. Please let me know if I need to make any changes on my end.
Thanks for this @lanzer and sorry for the slow reply - I've been away at various events. I will be incorporating these fixes in new releases in the next few days.
I'm going to take over having a look at this in the next few days; I also wrote a patch to fix this because I didn't realised there had been one in the pipeline for a while.
When a status code other than "200 OK" is received, the process would halt. This can be caused by a "404 not found" or server side problem such as exceeded bandwidth, or permission error. It's a problem for me as I am working with a big list of URL with entries that are potentially outdated.
I noticed that under the basic renderer (there is a headless renderer, but it isn't called even with the -h parameter), it doesn't listen for status code other than 200:
basic.js (14)
Also scraper.js does not have a listener for abnormal status:
scraper.js (252)
I've added a few lines to make things work for me
basic.js (14)
scraper.js (252)
Quickscrape does not read the result as an error and would report "0/0 elements captured (0 capture failed)", when it should read "0/2 elements" or whatever number configured in the JSON. Haven't looked into how reporting is handled.
For the time being, I noticed someting thresher.js
thresher.js (75)
That should probably be a comparison operator.
Hope this helps!