[AI] Create a Custom Dataset to Evaluate and Fine-Tune BERT

Objective

Develop a custom dataset to test and evaluate our BERT topic modeling model. The current dataset (default newsletter) makes it challenging to assess the model's quality. By creating a controlled dataset with predefined topics and structures, we can better evaluate the model's performance and identify areas for improvement.

Description

Data Sources:
- Option 1: Export messages from the Element messaging client.
- Option 2: Generate synthetic messages using AI tools following the format of Element messaging client.
Dataset Requirements:
- Include messages across 5–10 predefined topics.
- Introduce overlaps between topics to mimic real-world data complexity.
- Ensure a variety of messages per topic, utilizing different keywords and lexical fields.
- The dataset should be sizable enough (e.g., 500–1000 messages) for meaningful evaluation.

Steps to Follow

Define Predefined Topics:
- Identify specific topics relevant to our domain (e.g., technology, health, finance, wellness, entertainment, travel).
- For each topic, list associated keywords, phrases, and lexical fields.
Data Generation:
- Option 1: Export from Element Messaging Client
  - Collect messages that correspond to the predefined topics.
  - Use the test room create on Bored Labs Server in Element
  - Format messages consistently (e.g., JSON or CSV).
- Option 2: Generate Synthetic Messages Using AI
  - Use AI tools to create messages for each topic.
  - Craft prompts that guide the AI to produce messages with desired content and style.
  - Ensure messages are diverse in vocabulary and structure.
  - Ensure format matches what one would get from Element.
Incorporate Topic Overlaps:
- Design messages that intentionally include keywords from multiple topics.
- Create scenarios where topics naturally intersect.
Ensure Message Variety:
- Vary message lengths (short, medium, long).
- Include different writing styles and tones.
- Use synonyms and related terms to enrich lexical diversity.
- Test what happened if abbreviations or new words (example a new project) are introduced
Organize and Format the Dataset:
- Label each message with its corresponding topic(s) for validation purposes.
- Store messages in a format compatible with our BERT model (e.g., plain text files, CSV).
Quality Assurance:
- Review the dataset to verify topic representation and message quality (human review).
- Check for balance in the number of messages per topic.
- Ensure that overlaps are correctly implemented.

Validation Criteria for this task

Topic Coverage: Each predefined topic has an adequate number of messages (e.g., at least 50 messages per topic).
Overlaps Implemented: A subset of messages (e.g., 10–20%) should contain overlaps between topics.
Variety and Diversity: Messages exhibit a range of lengths, styles, and vocabulary.
Correct Labeling: All messages are accurately labeled with their topic(s).
Data Quality: Messages are coherent, relevant, and free of errors.

Expected Deliverables

A structured dataset containing all messages, ready for model input.
Documentation outlining the dataset creation process, including:
- Topics selected and associated keywords.
- Methodology for data collection/generation.
- Any scripts, prompts, or tools used in the process - so we don't have to start from scratch if we need other datasets to compare.

BoredLabsHQ / Concord

[AI] Create a Custom Dataset to Evaluate and Fine-Tune BERT #15

Objective

Description

Steps to Follow

Validation Criteria for this task

Expected Deliverables