Closed naorye closed 5 years ago
hi @naorye . Are you sure that this was your first request? 429 sounds like you made some prior requests from that IP. Could you also please post the response headers for further investigation?
Hi @codemanki, I am sorry, wasn't clear. I meant that I tried cloudscraper for the first time and it threw an error. I performed 100 GET requests, the 7th requests got cloudflare's page and that error was thrown. ... When I am thinking about that, there is a possibility that previous requests also got cloudflare's page and it did work.
What causes this exception?
@naorye There was a very slim chance that the Cloudflare page could be returned. I raised that issue in #162 and @codemanki closed/fixed it with #163.
The status code error happens whenever the status code is not 200 OK
. If that should not be an error, you may pass the simple: false
option to cloudscraper.
cloudcraper.get({ uri, simple: false}).then(console.log);
For more information on the simple
option, see request-promise's docs: https://github.com/request/request-promise#migration-from-v2-to-v3
Here is a gist of the HTML from the status code error above: https://gist.github.com/pro-src/54893b5153bcb9afcd1d3205b1bb3db2
@pro-src Is there a chance that cloudflare returns 429
code along with it's error page?
@naorye Yes but the challenge is always sent with 503 status so that error is mysterious. I ran the following test for a count of 50 and it didn't error at all. I can't reproduce this on the master branch so maybe we've already fixed it.
set -xe
for num in {1..50}
do
node test.js
done
// test.js
var cloudscraper = require('.');
cloudscraper.get('http://coinmarketcap.com').then(console.log, error => {
console.error(error);
process.exit(1);
});
@naorye What version of cloudscraper are you using? Can you test with the master branch of this repository?
rm -rf node_modules/cloudscraper
npm install --save 'https://github.com/codemanki/cloudscraper'
@pro-src Tried with simple: false
but then when I got cloudflares page, it doesn't manage to get to the page behind the protection. I am using "cloudscraper": "^3.3.0".
I'll try to install directly from github.
Edit: npm install directly from github results the same.
@naorye btw, thanks for reporting this. Can you please provide a minimal test case?
Be sure to remove cloudscraper from node_modules before installing directly from github and maybe remove the "^3.3.0" from package.json
. I still can't reproduce this using the master branch
I'll try to create a reproduction.
Running the following threw an error on the 15th scrape. It might take longer.
const cloudscraper = require('cloudscraper');
const urls = [
'https://coinmarketcap.com/currencies/bitcoin/',
'https://coinmarketcap.com/currencies/ethereum/',
'https://coinmarketcap.com/currencies/ripple/',
'https://coinmarketcap.com/currencies/litecoin/',
'https://coinmarketcap.com/currencies/eos/',
'https://coinmarketcap.com/currencies/bitcoin-cash/',
'https://coinmarketcap.com/currencies/binance-coin/',
'https://coinmarketcap.com/currencies/tether/',
'https://coinmarketcap.com/currencies/stellar/',
'https://coinmarketcap.com/currencies/cardano/',
'https://coinmarketcap.com/currencies/tron/',
'https://coinmarketcap.com/currencies/bitcoin-sv/',
'https://coinmarketcap.com/currencies/monero/',
'https://coinmarketcap.com/currencies/iota/',
'https://coinmarketcap.com/currencies/dash/',
'https://coinmarketcap.com/currencies/maker/',
'https://coinmarketcap.com/currencies/neo/',
'https://coinmarketcap.com/currencies/ontology/',
'https://coinmarketcap.com/currencies/ethereum-classic/',
'https://coinmarketcap.com/currencies/tezos/',
'https://coinmarketcap.com/currencies/nem/',
'https://coinmarketcap.com/currencies/zcash/',
'https://coinmarketcap.com/currencies/vechain/',
'https://coinmarketcap.com/currencies/basic-attention-token/',
'https://coinmarketcap.com/currencies/waves/',
'https://coinmarketcap.com/currencies/usd-coin/',
'https://coinmarketcap.com/currencies/dogecoin/',
'https://coinmarketcap.com/currencies/omisego/',
'https://coinmarketcap.com/currencies/qtum/',
'https://coinmarketcap.com/currencies/crypto-com-chain/',
'https://coinmarketcap.com/currencies/bitcoin-gold/',
'https://coinmarketcap.com/currencies/trueusd/',
'https://coinmarketcap.com/currencies/decred/',
'https://coinmarketcap.com/currencies/lisk/',
'https://coinmarketcap.com/currencies/0x/',
'https://coinmarketcap.com/currencies/augur/',
'https://coinmarketcap.com/currencies/chainlink/',
'https://coinmarketcap.com/currencies/zilliqa/',
'https://coinmarketcap.com/currencies/bitshares/',
'https://coinmarketcap.com/currencies/maximine-coin/',
'https://coinmarketcap.com/currencies/ravencoin/',
'https://coinmarketcap.com/currencies/icon/',
'https://coinmarketcap.com/currencies/holo/',
'https://coinmarketcap.com/currencies/digibyte/',
'https://coinmarketcap.com/currencies/bytecoin-bcn/',
'https://coinmarketcap.com/currencies/steem/',
'https://coinmarketcap.com/currencies/bittorrent/',
'https://coinmarketcap.com/currencies/nano/',
'https://coinmarketcap.com/currencies/bitcoin-diamond/',
'https://coinmarketcap.com/currencies/enjin-coin/',
'https://coinmarketcap.com/currencies/huobi-token/',
'https://coinmarketcap.com/currencies/paxos-standard-token/',
'https://coinmarketcap.com/currencies/kucoin-shares/',
'https://coinmarketcap.com/currencies/aeternity/',
'https://coinmarketcap.com/currencies/verge/',
'https://coinmarketcap.com/currencies/komodo/',
'https://coinmarketcap.com/currencies/pundi-x/',
'https://coinmarketcap.com/currencies/bytom/',
'https://coinmarketcap.com/currencies/siacoin/',
'https://coinmarketcap.com/currencies/iostoken/',
'https://coinmarketcap.com/currencies/aurora/',
'https://coinmarketcap.com/currencies/theta/',
'https://coinmarketcap.com/currencies/abbc-coin/',
'https://coinmarketcap.com/currencies/stratis/',
'https://coinmarketcap.com/currencies/dai/',
'https://coinmarketcap.com/currencies/insight-chain/',
'https://coinmarketcap.com/currencies/golem-network-tokens/',
'https://coinmarketcap.com/currencies/status/',
'https://coinmarketcap.com/currencies/populous/',
'https://coinmarketcap.com/currencies/ardor/',
'https://coinmarketcap.com/currencies/project-pai/',
'https://coinmarketcap.com/currencies/ark/',
'https://coinmarketcap.com/currencies/revain/',
'https://coinmarketcap.com/currencies/mixin/',
'https://coinmarketcap.com/currencies/cryptonex/',
'https://coinmarketcap.com/currencies/gemini-dollar/',
'https://coinmarketcap.com/currencies/gxchain/',
'https://coinmarketcap.com/currencies/hypercash/',
'https://coinmarketcap.com/currencies/digitex-futures/',
'https://coinmarketcap.com/currencies/factom/',
'https://coinmarketcap.com/currencies/maidsafecoin/',
'https://coinmarketcap.com/currencies/electroneum/',
'https://coinmarketcap.com/currencies/wax/',
'https://coinmarketcap.com/currencies/decentraland/',
'https://coinmarketcap.com/currencies/loom-network/',
'https://coinmarketcap.com/currencies/waltonchain/',
'https://coinmarketcap.com/currencies/crypto-com/',
'https://coinmarketcap.com/currencies/qash/',
'https://coinmarketcap.com/currencies/loopring/',
'https://coinmarketcap.com/currencies/pivx/',
'https://coinmarketcap.com/currencies/zcoin/',
'https://coinmarketcap.com/currencies/aelf/',
'https://coinmarketcap.com/currencies/waykichain/',
'https://coinmarketcap.com/currencies/thorecoin/',
'https://coinmarketcap.com/currencies/qubitica/',
'https://coinmarketcap.com/currencies/moac/',
'https://coinmarketcap.com/currencies/repo/',
'https://coinmarketcap.com/currencies/power-ledger/',
'https://coinmarketcap.com/currencies/kyber-network/',
'https://coinmarketcap.com/currencies/wanchain/',
];
function scrapePage(url) {
return new Promise((resolve, reject) => {
cloudscraper.get(url, (err, resp, html) => {
if (!err) {
resolve(html);
} else {
reject(err);
}
});
});
}
urls.reduce(async (promise, url, index) => {
await promise;
await scrapePage(url);
console.log(index);
}, Promise.resolve());
@naorye I was able to reproduce this and we'll have a fix for this soon.
@codemanki I think we should handle the case of 429 Too Many Requests
by respecting the Retry-After
header and partially revert #163 to re-include the isChallengePresent
check irrespective of statusCode but only if isCloudflare === true
. This may have been the one and only exception but to be safe...
As a side note: Anorov/cloudflare-scrape has this bug too
@naorye The fix in #165 successfully scrapes all 100 of them. Can you confirm?
npm install git://github.com/pro-src/cloudscraper.git#164_too_many_requests
Actually I have 2000 :) I'll check it in two hours and will update.
Thanks!
@naorye I just thought to mention that your test case can be simplified by replacing:
function scrapePage(url) {
return new Promise((resolve, reject) => {
cloudscraper.get(url, (err, resp, html) => {
if (!err) {
resolve(html);
} else {
reject(err);
}
});
});
}
With it's equivalent:
function scrapePage(url) {
return cloudscraper.get(url);
}
Edit: And if by chance that you need the response object, try the resolveWithFullResponse: true
option, more info in request-promise's docs.
Scraping 2128/2128 [============================================================] 100%
Done successfully! Thanks!!
When do you plan to merge it to master branch?
Awesome! Once the PR is reviewed and if everything is good. (Probably less than 24H until a new NPM release) Thanks for your contribution!!!
You are awesome! 🥇
Thank you guys for taking care of this. I will look through the PR tomorrow morning and will release a new version :)
Done! 3.4.0
has been just published. Thank you @pro-src :)
I am scraping 2000 urls and some of them got Cloudflare's page. I started my scraper and once I got Cloudflare's page, an error occurred:
Any idea what's wrong?