Altinn / digdir-assistants

Generative AI assistants
MIT License
3 stars 0 forks source link

Replace typesense-docsearch-scraper with crawlee #69

Closed bdb-dd closed 3 months ago

bdb-dd commented 3 months ago

Description

DocSearch Scraper, originally developed by Algolia and forked by TypeSense after Algolia took their code private, has served us well. However, there were many limitations and addressing them in such an aging code base looks to be non-trivial.

In particular, we would like support for incremental index builds, custom markdown selectors, multiple start urls and a few other minor improvements.

### Tasks
- [x] Initial demonstration of Playwright powered crawler
- [x] Support custom include and exclude url lists
- [x] Simplify typesense schema for docs
- [x] Support incremental index builds
- [x] Support custom locator definitions per url
bdb-dd commented 3 months ago

Typesense DocSearch crawler replaced and deployed to production.