DocSearch Scraper, originally developed by Algolia and forked by TypeSense after Algolia took their code private, has served us well. However, there were many limitations and addressing them in such an aging code base looks to be non-trivial.
In particular, we would like support for incremental index builds, custom markdown selectors, multiple start urls and a few other minor improvements.
### Tasks
- [x] Initial demonstration of Playwright powered crawler
- [x] Support custom include and exclude url lists
- [x] Simplify typesense schema for docs
- [x] Support incremental index builds
- [x] Support custom locator definitions per url
Description
DocSearch Scraper, originally developed by Algolia and forked by TypeSense after Algolia took their code private, has served us well. However, there were many limitations and addressing them in such an aging code base looks to be non-trivial.
In particular, we would like support for incremental index builds, custom markdown selectors, multiple start urls and a few other minor improvements.