GSA-TTS / jemison

An exploration of the space of search
Other
0 stars 0 forks source link

:computer: update text index design for real world use #14

Open jadudm opened 4 days ago

jadudm commented 4 days ago

At a glance

In order to see what I'm searching for as a user I want the actual text to be presented in search results

Acceptance Criteria

We use DRY behavior-driven development wherever possible.

### then... 
- [ ] [a thing happens]

Shepherd

Background

The prototype for jemison throws all content into a site_index table using SQLite's FTS5. This... does the job, but before throwing the content in, we do stopword removal. As a result, the content being indexed is not actually what is on the websites we're crawling.

So, in order to present actual/meaningful results, we're going to have to store the original content alongside the indexed content. The FTS5 searches will occur over the index, but we'll have to link back to the original, and present that as part of search results.

references

Some references that might be inspirational are in the wiki.

Security Considerations

Required per CM-4.

None, although at some point we have to do some filtering on fowl language. :duck:


Process checklist - [ ] Has a clear story statement - [ ] Can reasonably be done in a few days (otherwise, split this up!) - [ ] Shepherds have been identified - [ ] UX youexes all the things - [ ] Design designs all the things - [ ] Engineering engineers all the things - [ ] Meets acceptance criteria - [ ] Meets [QASP conditions](https://derisking-guide.18f.gov/qasp/) - [ ] Presented in a review - [ ] Includes screenshots or references to artifacts - [ ] Tagged with the sprint where it was finished - [ ] Archived ### If there's UI... - [ ] Screen reader - Listen to the experience with a screen reader extension, ensure the information presented in order - [ ] Keyboard navigation - Run through acceptance criteria with keyboard tabs, ensure it works. - [ ] Text scaling - Adjust viewport to 1280 pixels wide and zoom to 200%, ensure everything renders as expected. Document 400% zoom issues with USWDS if appropriate.
jadudm commented 11 hours ago

I have references in the wiki. The goal here is to keep both the stopword-cleaned text (possibly useful for searching?), and link it via id to the original text. It will change extract and the underlying SQLite table design, but otherwise it is of minimal impact.