PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.78k stars 210 forks source link

Germany_German Support #132

Closed lucky7xz closed 3 years ago

lucky7xz commented 3 years ago

Duplicated changes made for previous support expansions to enable scraping on indeed and monster with ".de" domain.


There are some problems though:

  1. IP ban? After a certain threshold of listings is processed (somewhere between 25 and 75), another query would yield this output: (Note that enabling a vpn allows for further queries)

[2021-01-28 13:55:51,824] [INFO] JobFunnel: Scraping local providers with: ['IndeedScraperGEGer', 'MonsterScraperGEGer'] [2021-01-28 13:55:53,348] [INFO] IndeedScraperGEGer: Found 2 pages of search results for query=python [2021-01-28 13:55:53,731] [INFO] IndeedScraperGEGer: Scraped 0 job listings from search results pages [2021-01-28 13:55:53,735] [ERROR] JobFunnel: Failed to scrape jobs for IndeedScraperGEGer [2021-01-28 13:55:53,737] [INFO] MonsterScraperGEGer: No get() or set() will be done for Job attrs: ['REMOTENESS'] [2021-01-28 13:55:54,605] [ERROR] JobFunnel: Failed to scrape jobs for MonsterScraperGEGer [2021-01-28 13:55:54,605] [INFO] JobFunnel: Completed all scraping, found 0 new jobs. [2021-01-28 13:55:54,625] [WARNING] JobFunnel: No new jobs were added to CSV.

  1. After the Indeed scraper is done, this error message appears. I've tried multiple province/city configurations.

C:\Users\Lucky\Scripts>funnel load -s my_settings.yaml [2021-01-28 13:02:27,031] [INFO] JobFunnel: Scraping local providers with: ['IndeedScraperGEGer', 'MonsterScraperGEGer'] [2021-01-28 13:02:28,423] [INFO] IndeedScraperGEGer: Found 1 pages of search results for query=python [2021-01-28 13:02:28,987] [INFO] IndeedScraperGEGer: Scraped 23 job listings from search results pages 100%|##################################################################################| 23/23 [00:30<00:00, 1.33s/it] [2021-01-28 13:02:59,601] [INFO] MonsterScraperGEGer: No get() or set() will be done for Job attrs: ['REMOTENESS'] [2021-01-28 13:03:00,363] [ERROR] JobFunnel: Failed to scrape jobs for MonsterScraperGEGer Traceback (most recent call last): File "C:\Users\Lucky\Scripts\funnel-script.py", line 11, in load_entry_point('JobFunnel==3.0.1', 'console_scripts', 'funnel')() File "C:\Users\Lucky\AppData\Roaming\Python\Python38\site-packages\jobfunnel__main__.py", line 28, in main job_funnel.run() File "C:\Users\Lucky\AppData\Roaming\Python\Python38\site-packages\jobfunnel\backend\jobfunnel.py", line 114, in run scraped_jobs_dict = self.scrape() File "C:\Users\Lucky\AppData\Roaming\Python\Python38\site-packages\jobfunnel\backend\jobfunnel.py", line 244, in scrape self._check_for_inter_scraper_validity( File "C:\Users\Lucky\AppData\Roaming\Python\Python38\site-packages\jobfunnel\backend\jobfunnel.py", line 220, in _check_for_inter_scraper_validity raise ValueError( ValueError: Inter-scraper key-id duplicate! 22e7f67c9200c7ce

thebigG commented 3 years ago

Thanks so much for German support!

That issue looks like #123. We are currently working on fixing this. I haven't gotten around to fix it. I'll try my best to fix it in the coming days so you can merge the changes and hopefully then we'll have fully-functiomal German support.

Really excited about this new addition 👍

codecov-io commented 3 years ago

Codecov Report

Merging #132 (8f536ae) into master (e509ef4) will decrease coverage by 0.22%. The diff coverage is 58.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
- Coverage   36.17%   35.95%   -0.23%     
==========================================
  Files          22       22              
  Lines        1454     1488      +34     
==========================================
+ Hits          526      535       +9     
- Misses        928      953      +25     
Impacted Files Coverage Δ
jobfunnel/backend/scrapers/base.py 39.87% <ø> (+0.88%) :arrow_up:
jobfunnel/backend/scrapers/indeed.py 25.40% <ø> (-1.59%) :arrow_down:
jobfunnel/backend/scrapers/registry.py 100.00% <ø> (ø)
jobfunnel/resources/defaults.py 100.00% <ø> (ø)
jobfunnel/backend/scrapers/monster.py 27.10% <28.57%> (+0.06%) :arrow_up:
jobfunnel/backend/tools/tools.py 29.87% <100.00%> (ø)
jobfunnel/resources/enums.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e509ef4...8f536ae. Read the comment docs.

lucky7xz commented 3 years ago

Thanks so much for German support!

That issue looks like #123. We are currently working on fixing this. I haven't gotten around to fix it. I'll try my best to fix it in the coming days so you can merge the changes and hopefully then we'll have fully-functiomal German support.

Really excited about this new addition 👍

My pleasure, truly. I want to say that I'm new to github. This is actually my first commit so don't really know what I'm doing thb. Nonetheless, I'm interested in what the problems are exactly and how they can be/are fixed. Luxembourg support would be really cool as well. The official languages are ENG, GER & FR over there, so if this works out, adding it should be no problem I think. And maybe I can hack that one on my own :)

Would really appreciate it if you could notify me with what you've done when you have it working. 🚀

PaulMcInnis commented 3 years ago

Closing this since the feature is being provided by PR #136.