Develop a custom dataset to test and evaluate our BERT topic modeling model. The current dataset (default newsletter) makes it challenging to assess the model's quality. By creating a controlled dataset with predefined topics and structures, we can better evaluate the model's performance and identify areas for improvement.
Description
Data Sources:
Option 1: Export messages from the Element messaging client.
Option 2: Generate synthetic messages using AI tools following the format of Element messaging client.
Dataset Requirements:
Include messages across 5–10 predefined topics.
Introduce overlaps between topics to mimic real-world data complexity.
Ensure a variety of messages per topic, utilizing different keywords and lexical fields.
The dataset should be sizable enough (e.g., 500–1000 messages) for meaningful evaluation.
Steps to Follow
Define Predefined Topics:
Identify specific topics relevant to our domain (e.g., technology, health, finance, wellness, entertainment, travel).
For each topic, list associated keywords, phrases, and lexical fields.
Data Generation:
Option 1: Export from Element Messaging Client
Collect messages that correspond to the predefined topics.
Use the test room create on Bored Labs Server in Element
Format messages consistently (e.g., JSON or CSV).
Option 2: Generate Synthetic Messages Using AI
Use AI tools to create messages for each topic.
Craft prompts that guide the AI to produce messages with desired content and style.
Ensure messages are diverse in vocabulary and structure.
Ensure format matches what one would get from Element.
Incorporate Topic Overlaps:
Design messages that intentionally include keywords from multiple topics.
Create scenarios where topics naturally intersect.
Ensure Message Variety:
Vary message lengths (short, medium, long).
Include different writing styles and tones.
Use synonyms and related terms to enrich lexical diversity.
Test what happened if abbreviations or new words (example a new project) are introduced
Organize and Format the Dataset:
Label each message with its corresponding topic(s) for validation purposes.
Store messages in a format compatible with our BERT model (e.g., plain text files, CSV).
Quality Assurance:
Review the dataset to verify topic representation and message quality (human review).
Check for balance in the number of messages per topic.
Ensure that overlaps are correctly implemented.
Validation Criteria for this task
Topic Coverage: Each predefined topic has an adequate number of messages (e.g., at least 50 messages per topic).
Overlaps Implemented: A subset of messages (e.g., 10–20%) should contain overlaps between topics.
Variety and Diversity: Messages exhibit a range of lengths, styles, and vocabulary.
Correct Labeling: All messages are accurately labeled with their topic(s).
Data Quality: Messages are coherent, relevant, and free of errors.
Expected Deliverables
A structured dataset containing all messages, ready for model input.
Documentation outlining the dataset creation process, including:
Topics selected and associated keywords.
Methodology for data collection/generation.
Any scripts, prompts, or tools used in the process - so we don't have to start from scratch if we need other datasets to compare.
Objective
Develop a custom dataset to test and evaluate our BERT topic modeling model. The current dataset (default newsletter) makes it challenging to assess the model's quality. By creating a controlled dataset with predefined topics and structures, we can better evaluate the model's performance and identify areas for improvement.
Description
Data Sources:
Dataset Requirements:
Steps to Follow
Define Predefined Topics:
Data Generation:
Option 1: Export from Element Messaging Client
Option 2: Generate Synthetic Messages Using AI
Incorporate Topic Overlaps:
Ensure Message Variety:
Organize and Format the Dataset:
Quality Assurance:
Validation Criteria for this task
Expected Deliverables