PublicDataWorks / verdad-frontend

MIT License
1 stars 0 forks source link

VER-231: [Backend] Generate Structured Text Documents for Snippet Vector Embeddings #207

Open nhphong opened 12 hours ago

nhphong commented 12 hours ago

We need to generate structured text documents for each snippet to create vector embeddings for the "related snippets" feature. This document should include essential information such as title, summary, content, topics, and keywords to ensure accurate and efficient semantic search.

Acceptance Criteria:

  1. Document Structure:
    • Each snippet document should include the following fields:
      • Title: {snippet['title']['english']}
      • Summary: {snippet['summary']['english']}
      • Content: {snippet['transcription']}
      • Topics: {', '.join(cat['english'] for cat in snippet['disinformation_categories'])}
      • Keywords: {', '.join(snippet['keywords_detected'])}
  2. Rationale Compliance:
    • The document structure should follow the rationale:
      • Title provides immediate context.
      • Summary offers a quick overview.
      • Main content contains core information.
      • Topics and keywords help with categorization and search.
  3. Benefits Implementation:
    • Ensure the structure maintains hierarchical importance and semantic relationships.
    • Keep documents within token limits for efficient processing (well under 8,191 tokens).
  4. Flexibility in Detail:
    • Determine the appropriate level of detail to balance the breadth and focus of search results based on user needs.
  5. Testing and Validation:
    • Validate that generated documents are accurate and meet the specified structure.
    • Test the impact of different levels of detail on search result quality.

Tasks:

  1. Analyze existing snippet data and determine how to extract and format the required information.
  2. Develop a script or module to generate structured text documents for each snippet.
  3. Implement logic to balance detail levels in the documents based on expected search result needs.
  4. Conduct tests to ensure documents are correctly formatted and contain expected information.
  5. Evaluate the performance of vector embeddings created from these documents in the "related snippets" feature.
  6. Document the process and considerations for generating snippet documents.

Additional Notes:

linear[bot] commented 12 hours ago

VER-231 [Backend] Generate Structured Text Documents for Snippet Vector Embeddings