B-Open / jobbuzz

Brunei job search database and alert notification
https://jobbuzz.org
MIT License
1 stars 1 forks source link

Changing scraper logic to be more resilient #31

Closed dsychin closed 2 years ago

dsychin commented 2 years ago

Currently with go-colly, there is a potential issue where if 1 page fails to load then the data is considered corrupted because we need the whole set of data in order to determine which job listings are active or inactive.

There is no retry functionality in go-colly and error handling is not very useful.

I think it might be better for us to fetch the html as string (where we can have our own retry logic) then use an HTML parser to process the data instead.

This will be more similar to the logic of the scraper in the .NET version.

Get html node in go with css selector: https://github.com/PuerkitoBio/goquery

Retry: https://github.com/avast/retry-go

dsychin commented 2 years ago

@syahnur197 FYI