Closed hannibal1986 closed 4 years ago
Hi,
Yes, Arenavision just put Cloudflare to protect the site, and now it recognises non-human interactions within the site. I have to research how to bypass this. The easiest (and costly) way would be scraping it with Puppeteer or something like that, but this consume lots of resources as it deploys a Google Chrome in your machine (and I have things running using this in a 5$ VPS, so that's not an option).
This weekend I will research some things to do, it looks like Arenavision is protecting a little bit from us!
And thank you for your comment!
Hi again,
I have been testing today that Puppeteer approach to get the data and it works, but Puppeteer launches a whole instance of Google Chrome for doing its job, so for doing the exact same thing it takes a while.
So, I think that I will modify the library and publish the modifications, but I am thinking on doing the information extraction myself and provide an API for everyone with the data already extracted. With this anyone with any language can query it and I already store the information in a DB, so maybe that's the solution to it.
Anyway, we will see this weekend.
thanks, great job
Today guide is back to the past with no img 🤣
Yeah, maybe they thought that if we were to get the info anyway, they just switched to Cloudflare to protect themselves from this thing.
At least this is funny, researching how to bypass all the things they put as protection.
I have uploaded a new version after holidays.
Now it works, but it depends in your IP I think. In my laptop works, but in neither in my server nor in TravisCI is working. Maybe they have banned some source IPs, I don't know. I will try to deploy it in a Now.sh server, just to see if I can access from there.
@hannibal1986 , could you install the new version and run a npm run test
in your local just to test if it works elsewhere than in my laptop? New version is 1.0.31
.
Thank you.
I have uploaded a new version after holidays.
Now it works, but it depends in your IP I think. In my laptop works, but in neither in my server nor in TravisCI is working. Maybe they have banned some source IPs, I don't know. I will try to deploy it in a Now.sh server, just to see if I can access from there.
@hannibal1986 , could you install the new version and run a
npm run test
in your local just to test if it works elsewhere than in my laptop? New version is1.0.31
.Thank you.
works fine the new version, thanks for this big job
2 days working fine, and today this:
UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'attribs' of undefined at fetch.then.then.res (arenavision-scraper/src/getGuide.js:132:33) at processTicksAndRejections (internal/process/task_queues.js:86:5) (node:29456) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1) (node:29456) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Yep, just fixed in the version 1.0.32.
How long will it last? I don't know.
very thanks for all
I have uploaded a new version, 1.0.33, which just removes everything regarding the image-processing library, as it was causing some problems for the library to be deployed in Firebase Functions.
With that version of the library I have scheduled a function in Firebase Functions to extract everything and it is working. It takes 5 minutes or so because there is only 256 MB of RAM available in a Firebase Function but it does the trick and removes the need of a server and, maybe, they don' block Google IPs.
Is it possible to run arenavision scraper with useragent so as not to be banned by cloudflare?
It was the change I introduced with the referer in the header to make it work, but I can try, of course. I have been banned in 3 different servers up to date.
I cannot test to remove the UserAgent in any place because every place I have it deployed has been banned, so I think I will give up for now.
I will disable every cron jobs and, if within a month I can fix it, I will, but after two years struggling with the Arenavision guys (which they are doing the right thing) I am giving up for the moment.
thanks, very thanks for this great job
Closing this issue as the library is going to be deprecated. Please refer to #6
Thanks for your work