PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.78k stars 210 forks source link

[PROPOSAL] Decouple The Web Engine #145

Closed thebigG closed 2 years ago

thebigG commented 3 years ago

Hi there

Hope you are all doing well!

Is your feature request related to a problem? Please describe. This is related to many problems that have appeared recently(CAPTCHA), but also related to issues we have had in the past(Dynamically loaded websites such as Glassdoor). Look at issues #144 and #142.

Describe the solution you'd like I think CAPTCHA related problems could be solved by taking the approach suggested on #142 by using https://github.com/pgaref/HTTP_Request_Randomizer. However I'm thinking that the best way to approach this would be to make the web engine(using selenium) a factory. Instead of having the web engine be part of the Job class, it could be decoupled altogether and have a function that looks something like:

def get_web_engine(headless: bool, arg1, arg2, etc):
   proxy = get_random_proxy()
   engine = init_web_engine
   ...
   return engine

This way if we get CAPTCHA in any step of scraping(whether it is while getting the description, number of job pages, etc) we can just request a new web engine from the function above that has a new proxy.

As you can see this also implies switching to Selenium, which I guess I'm proposing here as well. The reason for this is that if we switch to Selenium, we support static and dynamic sites. And it looks like the web drivers do have headless support, which is one of the main reasons why in the past we didn't use Selenium.

Describe alternatives you've considered So far this is the only way I can think about tackling this at the moment. If anyone else has any other ideas, please don't hesitate to provide feedback!

Additional context

Hope these ideas make sense. Cheers Lorenzo

PaulMcInnis commented 3 years ago

It may be also worth looking into what other web scraping services do, as there do exist commercial offerings which provide similar capabilities as jobfunnel.

Other stopgaps are selenium on scrape failure, or more configurability for VPNs (i.e. switch VPNs after N scrapes / scrape failure).

We can fairly easily detect the "I am human" page. In the short term I think we should provide a better error for Indeed specifically around detecting this page.

As an aside I just tested it now and got to ~66 scrapes before the CAPTCHA, oh well.

thebigG commented 3 years ago

As an aside I just tested it now and got to ~66 scrapes before the CAPTCHA, oh well.

Right. I noticed this too a couple of weeks back. And this is exactly why I thought the factory pattern for Selenium might be a good fit. If a scrape fails(and like you said we should have better mechanisms for error detection for when CAPTCHA shows up), then we just send the request via a random proxy.