BoredLabsHQ / Concord

Concord is an open-source AI plugin designed to connect community members to relevant conversations
GNU General Public License v3.0
2 stars 3 forks source link

[AI] Identify similar BertTopics #29

Open sajz opened 2 weeks ago

sajz commented 2 weeks ago

To handle cases where new topics from a message are similar to existing ones in the channel without creating duplicates, we can use a topic similarity threshold to decide if the new topic should merge with an existing topic or be created as a new one. Here’s a proposed approach:

Proposed Steps

  1. Compute Topic Similarity:

    • When BERT identifies a new topic in a message, compare this new topic’s semantic_vector with each existing topic in the channel’s ASSOCIATED_WITH relationships.
    • Use a similarity metric, such as cosine similarity, between the new topic’s vector and each existing topic’s vector.
  2. Set a Similarity Threshold:

    • Define a similarity threshold, e.g., 0.8, above which the new topic is considered “similar enough” to an existing topic. This threshold can be adjusted based on testing.
  3. Merge or Create Logic:

    • If Similarity is Above Threshold:
      • Merge the new topic with the existing topic that has the highest similarity score.
      • Update the existing topic’s overall_score using the amplify_score function based on the relevance of the new topic in the message.
    • If Similarity is Below Threshold for All Existing Topics:
      • Treat the new topic as distinct, create a new Topic node, and establish the ASSOCIATED_WITH relationship for tracking in this channel.
  4. Optional: Store Relatedness Data:

    • For transparency and future adjustments, record similarity data in the RELATED_TO relationship between topics. This way, if similar topics keep emerging, you can track these relationships for potential reorganization or clustering later.

Example Flow:

  1. Analyze New Message:

    • A new topic appears in the message with a semantic_vector.
  2. Similarity Comparison:

    • Compute cosine similarity between this new topic’s semantic_vector and each existing topic in the channel.
  3. Apply Threshold Decision:

    • Above Threshold (e.g., 0.8): Update the most similar existing topic’s score using amplify_score.
    • Below Threshold: Create a new topic entry and start tracking it as a distinct topic.