[BUG]: Compare model generates nonsensical summary of two models

mwdchang commented 3 months ago

https://app.staging.terarium.ai/projects/33f364c1-1da2-4cf7-a176-9870ef1a3ab6/workflow/a7a94429-6bd3-457a-a8c9-a57c2f3525e9?operator=77a5de63-a58f-4a9a-9456-aabd7a135623

Compare an SEIRHD model with an SIR model, the summary starts to talk about image classifiers, pronoun biases, and training cost. None of these are relevant to the actual model inputs, which are not image classifiers nor require any sort of training.

Likewise, comparing the same model against itself (SIR vs SIR) generated garbage, it doesn't know the models are identical and hallucinates.

Example of SEIRHD vs SIR

To provide a detailed comparison of the metadata for multiple models, I will focus on several key aspects: 
model architecture, training data, performance metrics, intended use cases, ethical considerations, and limitations. 
This will help domain experts understand the nuances and trade-offs between the models. 

### Model Architecture 
  **Model A:** 
  - **Type:** Transformer-based 
  - **Layers:** 12 
  - **Parameters:** 110 million 
  - **Special Features:** Incorporates a novel attention mechanism to improve long-range dependencies. 

  **Model B:** 
  - **Type:** Convolutional Neural Network (CNN) 
  - **Layers:** 20 - **Parameters:** 50 million 
  - **Special Features:** Utilizes depthwise separable convolutions to reduce computational complexity. 

### Training Data 
  **Model A:** 
  - **Dataset Size:** 1 billion tokens 
  - **Sources:** Diverse text corpora including books, articles, and web pages. 
  - **Preprocessing:** Tokenization, lowercasing, removal of special characters. 
  **Model B:** 
  - **Dataset Size:** 10 million images - **Sources:** Publicly available image datasets such as ImageNet. 
  - **Preprocessing:** Normalization, resizing, and data augmentation techniques like rotation and flipping. 

### Performance Metrics 
  **Model A:** 
  - **Accuracy:** 92% on benchmark text classification tasks. 
  - **F1 Score:** 0.89 
  - **Latency:** 50ms per inference on a standard GPU. 
  **Model B:** 
  - **Accuracy:** 95% on image recognition tasks. 
  - **F1 Score:** 0.93 
  - **Latency:** 30ms per inference on a standard GPU. 

 ### Intended Use Cases 
  **Model A:** 
  - **Primary Applications:** Text classification, sentiment analysis, and language translation. 
  - **Secondary Applications:** Named entity recognition (NER), summarization. 

  **Model B:** 
  - **Primary Applications:** Image classification, object detection, and facial recognition. 
  - **Secondary Applications:** Image segmentation, style transfer.

  ### Ethical Considerations 
  **Model A:** 
  - **Bias:** Potential for gender and racial bias due to imbalanced training data. For example, if the training data contains more male pronouns, the model may exhibit a bias towards male entities in tasks like NER. 
  - **Mitigation Strategies:** Implementing bias detection algorithms and augmenting the dataset with more balanced examples. 

  **Model B:** 
  - **Bias:** Risk of reinforcing stereotypes in image recognition tasks. For instance, if the dataset contains more images of certain ethnic groups in specific contexts, the model may develop biased associations. 
  - **Mitigation Strategies:** Using diverse and representative datasets, and applying fairness-aware algorithms during training. 

  ### Limitations 
  **Model A:** 
  - **Scalability:** High computational requirements for training and inference, making it less suitable for edge devices. 
  - **Generalization:** May struggle with domain-specific jargon or highly specialized texts not represented in the training data. 

  **Model B:** 
  - **Scalability:** While more efficient than Model A, it still requires significant computational resources for training. 
  - **Generalization:** Performance drops significantly on images that differ from the training data, such as those with unusual lighting or occlusions. 

  ### Conclusion Both models have their strengths and weaknesses, making them suitable for different types of tasks. Model A excels in natural language processing applications but requires substantial computational resources and careful bias mitigation. Model B is highly effective for image-related tasks and is relatively more efficient but still faces challenges in generalization and bias. Understanding these details can help domain experts make informed decisions about which model to deploy based on their specific needs and constraints.

YohannParis commented 3 months ago

Needs to review the test, and we need to have a better model card without document and add a new feature to link a document to a model

YohannParis commented 3 months ago

Ask @j2whiting to see how we can improve relevances of answers.

DARPA-ASKEM / terarium

[BUG]: Compare model generates nonsensical summary of two models #4400