Closed Gelassen closed 1 year ago
More info on making web crawler looks natural https://scrapfly.io/blog/how-to-avoid-web-scraping-blocking-headers/
Advices on avoid crawler getting banned https://docs.scrapy.org/en/latest/topics/practices.html#bans
The most comprehensive overview or protecting measures from scraping: https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md
Proxy might be required to access some sources, e.g. stepstone.de. However, even after adding proxy I had an issue to process this site, it requires more investigation.
Unfortunately, I have to stop work on this: I discover <site>/robotx.txt
everywhere deny custom crawlers. If I want to respect platforms owners will, I have to make my crawlers obey this rules.
In case I have some money to invest into the project, crowdsourcing this data might be an option.
User story
As a user I want to receive all vacancies for a passed month from hh.ru which match Android and Software Engineer keywords
Tech notes
R&D shown several interesting solutions on the market:
At this moment basic scraper is ready.
The current issue is not correct encoding.Another issue is wrong parsing of data which has formatООО <!----> Компания
.Extension to this scraper would require extracting job description on the next page and save it into NoSql database. Future job boards support: indeed, SEEK, LinkedIn, StepStone etc.
The important point is currently scrapper parse 220 items when hh page shows 560 items. The root cause is not clear yet.
In general it is a subtask of the bigger task to build a platform which will scape interesting job boards and provide vacancies sorted by score which is calculated based on my previous feedback.