deadbits / trs

🔭 Threat report analysis via LLM and Vector DB
https://trs.deadbits.ai
Apache License 2.0
9 stars 1 forks source link

Handle duplicate URLs #2

Closed deadbits closed 1 year ago

deadbits commented 1 year ago

Every time a URL is provided to a !command, it will be retrieved, parsed, and stored in the vector database - even if that same URL has already been processed by another command within the same session or a previous session. So you get multiple identical entries in the vector database and therefore multiple identical results when querying the vector db for context.

This kind of sucks. The vector db will only ever return n results per the config so I'm not too worried about this filling up the LLM context length, but it makes the database needlessly large.

Simple fix could be hash the URL and add it to a json file after it's been processed, check inputs against that list, and don't store if it's already been seen.

That fix also raises another issue- should we be retrieving the URL every time if we've already seen it? A URL's text content is stored in full in the vector database, so it could be possible to use the source URL as reference to recreate the original text from the vector db stored data (if I also store metadata for the order of the text)... document management sucks.

deadbits commented 1 year ago

Done. Processed URLs are added to a text file. This isn’t great but avoids duplicate db storage for the meantime.