Gelassen / web-crawlers

0 stars 0 forks source link

[feature] basic crawler #1

Closed Gelassen closed 1 year ago

Gelassen commented 1 year ago

User story

As a user I want to receive all vacancies for a passed month from hh.ru which match Android and Software Engineer keywords

Tech notes

R&D shown several interesting solutions on the market:

At this moment basic scraper is ready. The current issue is not correct encoding. Another issue is wrong parsing of data which has format ООО <!----> Компания.

Extension to this scraper would require extracting job description on the next page and save it into NoSql database. Future job boards support: indeed, SEEK, LinkedIn, StepStone etc.

The important point is currently scrapper parse 220 items when hh page shows 560 items. The root cause is not clear yet.

In general it is a subtask of the bigger task to build a platform which will scape interesting job boards and provide vacancies sorted by score which is calculated based on my previous feedback.

Gelassen commented 1 year ago

More info on making web crawler looks natural https://scrapfly.io/blog/how-to-avoid-web-scraping-blocking-headers/

Gelassen commented 1 year ago

Advices on avoid crawler getting banned https://docs.scrapy.org/en/latest/topics/practices.html#bans

Gelassen commented 1 year ago

The most comprehensive overview or protecting measures from scraping: https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md

Gelassen commented 1 year ago

Proxy might be required to access some sources, e.g. stepstone.de. However, even after adding proxy I had an issue to process this site, it requires more investigation.

Unfortunately, I have to stop work on this: I discover <site>/robotx.txt everywhere deny custom crawlers. If I want to respect platforms owners will, I have to make my crawlers obey this rules.

In case I have some money to invest into the project, crowdsourcing this data might be an option.