BoredLabsHQ / Concord

Concord is an open-source AI plugin designed to connect community members to relevant conversations
GNU General Public License v3.0
2 stars 3 forks source link

[AI] Create a Custom Dataset to Evaluate and Fine-Tune BERT #15

Open sajz opened 2 weeks ago

sajz commented 2 weeks ago

Objective

Develop a custom dataset to test and evaluate our BERT topic modeling model. The current dataset (default newsletter) makes it challenging to assess the model's quality. By creating a controlled dataset with predefined topics and structures, we can better evaluate the model's performance and identify areas for improvement.

Description

Steps to Follow

  1. Define Predefined Topics:

    • Identify specific topics relevant to our domain (e.g., technology, health, finance, wellness, entertainment, travel).
    • For each topic, list associated keywords, phrases, and lexical fields.
  2. Data Generation:

    • Option 1: Export from Element Messaging Client

      • Collect messages that correspond to the predefined topics.
      • Use the test room create on Bored Labs Server in Element
      • Format messages consistently (e.g., JSON or CSV).
    • Option 2: Generate Synthetic Messages Using AI

      • Use AI tools to create messages for each topic.
      • Craft prompts that guide the AI to produce messages with desired content and style.
      • Ensure messages are diverse in vocabulary and structure.
      • Ensure format matches what one would get from Element.
  3. Incorporate Topic Overlaps:

    • Design messages that intentionally include keywords from multiple topics.
    • Create scenarios where topics naturally intersect.
  4. Ensure Message Variety:

    • Vary message lengths (short, medium, long).
    • Include different writing styles and tones.
    • Use synonyms and related terms to enrich lexical diversity.
    • Test what happened if abbreviations or new words (example a new project) are introduced
  5. Organize and Format the Dataset:

    • Label each message with its corresponding topic(s) for validation purposes.
    • Store messages in a format compatible with our BERT model (e.g., plain text files, CSV).
  6. Quality Assurance:

    • Review the dataset to verify topic representation and message quality (human review).
    • Check for balance in the number of messages per topic.
    • Ensure that overlaps are correctly implemented.

Validation Criteria for this task

Expected Deliverables