PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.87k stars 217 forks source link

[DISCUSSION] Captcha #142

Closed PaulMcInnis closed 3 years ago

PaulMcInnis commented 3 years ago

Hey everyone,

It seems that indeed and others have caught on to scraping and have taken action to stop it.

We can integrate web-driven scraping but this is not easily automated or tested.

I think this may be a serious problem for this tool in general, the regexes we have built still work, but captcha is catching the scrapers very easily, after under a hundred jobs or so.

Does anyone have any ideas to help with this issue?

PaulMcInnis commented 3 years ago

One option is that we go the route of a web-driven scraper, perhaps this tool could be made into some kind of browser extension?

PaulMcInnis commented 3 years ago

Another option is to forgo scraping detailed job information entirely, but this will significantly degrade the matching and data quality.

Nllii commented 3 years ago

I tried using this code from geohot a couple of years back, I never got it to work. its's not practical code, just a doodle.

https://github.com/geohot/lolrecaptcha

https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf

PaulMcInnis commented 3 years ago

Well, one aspect of this is that I dont want to automate the captcha dodging since I think that is ethically dubious, but I think we have other options for the workflow maybe.

One datapoint that im having a bit of trouble collecting is on average how many jobs one can scrape before they get captcha'd (None error on detail scrape).

aseams commented 3 years ago

Maybe it could pick from a list of proxies? Would probably get rid of the captcha all together. Edit: Also I'd like to add that at 200 jobs exactly, I got the captcha treatment.

PaulMcInnis commented 3 years ago

yeah I get dinged pretty quick nowadays, I figure i'm on their $hit list :laughing:

Not a bad idea around the proxies, that would be an interesting feature, I'll create a little feature-stub for this.

Nllii commented 3 years ago

yeah I get dinged pretty quick nowadays, I figure i'm on their $hit list 😆

Not a bad idea around the proxies, that would be an interesting feature, I'll create a little feature-stub for this.

for proxies I have used https://github.com/TheSpeedX/PROXY-List ,mainly for mega.io limiting upload and downloads.
https://github.com/tonikelope/megabasterd.git , MBD has a feature where it picks the next proxy once it gets throttle; it triggers the next proxy in the list. I haven't used proxies on JobFunnel yet. Can't wait to try it out if I get block.

P.s.. Youtube still gives me captcha once a week now. It was every 4 hours since December 2020, now it's once a week. I think they are outsourcing machine learning labels to me.

PaulMcInnis commented 3 years ago

Based on this discussion, we will move forwards with #145