PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.78k stars 210 forks source link

Issues scraping large cities #123

Closed oosokoya closed 3 years ago

oosokoya commented 3 years ago

Description

Over the last few weeks i've had trouble using job funnel to scrape jobs in large cities (e.g. New York, Atlanta, etc.). Smaller cities such as Oklahoma City seem to be ok (under 5 pages with under 300 jobs to scrape). When scraping larger cities there are often over 27 pages and 1300+ jobs to be scraped which seems to cause an issue and after the job is complete error messages are displayed ( shown in the actual behavior section) and no excel file is created.

Note that I installed the job funnel onto a new machine and encounter the exact same problem.

Steps to Reproduce

Example search

locale: USA_ENGLISH State: NY City : New York Radius: 30 miles Key Words: Project Manager

All other settings are default

Expected behavior

Indeed and Monster sites scraped and excel file with results created

Actual behavior

Scraping process is completed however the following error message is generated (No excel file is created)

File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\base.py", line 196, in scrape job_soups = self.get_job_soups_from_search_result_listings() File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 224, in get_job_soups_from_search_result_listings __get_job_soups_by_key_id(next_listings_page_soup) File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 206, in __get_job_soups_by_key_id return { File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 207, in self.get(JobField.KEY_ID, job_soup): job_soup File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 109, in get return soup.find('h2', attrs={'class': 'title'}).find('a').get( AttributeError: 'NoneType' object has no attribute 'find'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\Scripts\funnel-script.py", line 33, in sys.exit(load_entry_point('JobFunnel==3.0.1', 'console_scripts', 'funnel')()) File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel__main__.py", line 28, in main job_funnel.run() File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\jobfunnel.py", line 114, in run scraped_jobs_dict = self.scrape() File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\jobfunnel.py", line 236, in scrape incoming_jobs_dict = scraper.scrape() File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\base.py", line 198, in scrape raise ValueError( ValueError: Unable to extract jobs from initial search result page: 'NoneType' object has no attribute 'find'

Possible issues

I noticed on one of the scrapes when I placed the link into a browser a captcha page came up asking me to verify that I wasn't a robot. Could it be that larger scrapes trigger the captcha causing the scrape to error and the whole process to fail ??

Let me know if you require more information.

Environment

Windows 10 machine

thebigG commented 3 years ago

Interesting. Will try to reproduce.

Yes, we have had issues with CAPTCHA in the past. We have even managed to have workarounds for it. But it's really tricky because basically we have to use Selenium, which literally opens up a browser window and then the user would have the ability to solve a CAPTCHA if one comes up. The problem with this approach is that it is very slow when compared to static scraping(which is what is done currently) and is not as a smooth experience for users compared to what we have now.

Like I said, will try to reproduce and will give you more feedback on your issue.

thebigG commented 3 years ago

Quick question: which site gave you the CAPTCHA? Or was it both of them?

oosokoya commented 3 years ago

I noticed the captcha on the Indeed site. I'll do some troubleshooting and see if anything occurs on the monster site.

thebigG commented 3 years ago

Thanks for the quick response! That's a new one. I'm currently running an instance of JobFunnel with your keywords/args and it has not crashed so far. Will keep you posted.

thebigG commented 3 years ago

I was able to reproduce! I highly suspect you got the same error as me. Do you mind checking the log generated by jobfunnel? It should be under a folder with a name similar to ...search_results. The log file should be called log.log. If you can find it, check if there is an error similar that says something akin to share duplicate key_id:in there.

This looks like an issue with the following snippet of code in base.py:

                if job:
                    # Handle inter-scraped data duplicates by key.
                    # TODO: move this functionality into duplicates filter
                    if job.key_id in jobs_dict:
                        self.logger.error(
                            "Job %s and %s share duplicate key_id: %s",
                            job.title, jobs_dict[job.key_id].title, job.key_id
                        )
                    else:
                        jobs_dict[job.key_id] = job

Don't have any more time tonight to investigate this further because it's getting kind of late :sweat_smile:, but if I had to guess it looks like for some reason there is a key conflict in the job dictionary.

Will investigate further when I get more time tomorrow evening.

Thanks so much for bringing this to our attention!

Cheers!

oosokoya commented 3 years ago

I have just checked the logs and can see the same thing "share duplicate key_id:"

thebigG commented 3 years ago

Haven't had time to look at this issue in-depth. I've had my hands full with jobfunnel testing at the moment. For now, as a quick fix, you can comment out one of your providers in your settings file like so: # - INDEED

and scrape using one scraper at the time.

Hope this helps!