Closed JasonLo closed 1 month ago
Discussed the update should be as follow:
Steps:
docid
.docid
from Geoarchive, CriticalMASS.docid
, call preprocessorv2 (v1 + paragraph ordering).hashed_text
for each paragraph. If unchanged, retrieve embedding data from existing Weaviate.docid
and reprocess everything in it.Additional step 0: Clear space by dropping Passage class (plus clean up other junk on that machine) and backup existing collection. Backup started as of 2024-07-29 10:45.
To support a downstream feature in the USGS project
Is there an issue outlining this that we could link to here?
(edited format)
@ilmcconnell , @iross
Started the patch for paragraph order, but it's slower than expected due to the lack of a batch update function. Estimated completion time: 4-5 days. Will check progress on 8/7.
The patch has been completed ahead of schedule.
To support a downstream feature in the USGS project, we need to add a field in Weaviate to indicate paragraph order within a document. The required changes are:
paragraph_order
.Assuming the raw text remains unchanged, we can likely skip re-embedding by using the old embeddings with
paragraph_hash
. this should be safe and efficient.