PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.85k stars 215 forks source link

This setup won't work for the urls which have prefix as domain (like Indeed for New Zealand) #60

Closed riyaagrahari closed 4 years ago

riyaagrahari commented 4 years ago

Issue Template

Description

If I need to search for jobs in New-Zealand, it won't work because url formation always adds domain as prefix.

URL formation is developed in indeed.py It will be great if it provides support for URLs which have domain in prefix.

markkvdb commented 4 years ago

JobFunnel does not support countries other than Canada and the US at the moment. It might work by accident but most likely not. Adding more countries has one of the highest priorities but it has proved to be time-consuming work. The problem is that every country has its little problems.

riyaagrahari commented 4 years ago

JobFunnel does not support countries other than Canada and the US at the moment. It might work by accident but most likely not. Adding more countries has one of the highest priorities but it has proved to be time-consuming work. The problem is that every country has its little problems.

For now there are three files for scraping indeed.py, glassdoor.py, monster.py. What if we add an additional file which checks for url formed by adding domain name as prefix. It will check for the files and whichever url hits will get the scrape result

markkvdb commented 4 years ago

Good suggestion. I have been thinking for a while what would be the best way to support more countries since it seems that all countries need small specialisations (especially non-English speaking countries). Therefore, I was thinking of adding support for other countries one-by-one. One of the adaptions is indeed the base url. I was personally thinking of creating a dictionary with domain names as keys and the base urls as the values.

Your suggested approach might work for New Zealand but The Netherlands for example does have a completely different URL for monster -- monsterboard.nl instead of monster.nl.

riyaagrahari commented 4 years ago

Yes, this was one of the case scenario for one country. I researched and found the same. We can keep a dictionary of domain name and URL respectively. But it is much time-consuming task. Would really need help for adding this functionality

markkvdb commented 4 years ago

The problem is not so much to add a dictionary with URLS for the supported domains but rather all the nitty-gritty changes necessary to support different countries. It requires a systematic review of all facets of the scraper. This would allow to see similarities between all domains and see specific difference. With this, we could adapt the code base to add these other domains in an as-abstract-as-possible manner.

In short, this is probably the hardest functionality to add to JobFunnel but it could be a lot of fun to work on it together with a few contributors. We must however decide on how we want to approach this. If you'd ask me, I would go for a "theoretical" analysis first before even starting on changing code.