UW-Madison-DSI / ask-xDD

Retrieval-Augmented Generation (RAG) on 17M full text journal articles.
https://xdd.wisc.edu/
MIT License
2 stars 2 forks source link

Add paragraph sequence field #118

Closed JasonLo closed 1 month ago

JasonLo commented 1 month ago

To support a downstream feature in the USGS project, we need to add a field in Weaviate to indicate paragraph order within a document. The required changes are:

  1. Redefine the Weaviate schema.
  2. Update the preprocessor to include paragraph_order.
  3. Update the ingest pipeline.
  4. Rebuild Weaviate (if necessary).

Assuming the raw text remains unchanged, we can likely skip re-embedding by using the old embeddings with paragraph_hash. this should be safe and efficient.

JasonLo commented 1 month ago

Discussed the update should be as follow:

Steps:

  1. Gather a master list of docid.
  2. Subset docid from Geoarchive, CriticalMASS.
  3. For each docid, call preprocessorv2 (v1 + paragraph ordering).
  4. Compare hashed_text for each paragraph. If unchanged, retrieve embedding data from existing Weaviate.
  5. If changed, drop paragraphs with the same docid and reprocess everything in it.
iross commented 1 month ago

Additional step 0: Clear space by dropping Passage class (plus clean up other junk on that machine) and backup existing collection. Backup started as of 2024-07-29 10:45.

​To support a downstream feature in the USGS project

Is there an issue outlining this that we could link to here?

(edited format)

JasonLo commented 1 month ago

https://github.com/UW-xDD/text2graph_llm/issues/20

JasonLo commented 1 month ago

@ilmcconnell , @iross

Started the patch for paragraph order, but it's slower than expected due to the lack of a batch update function. Estimated completion time: 4-5 days. Will check progress on 8/7.

JasonLo commented 1 month ago

119

The patch has been completed ahead of schedule.