Rewrite readme more novice-friendly

Current Readme page are quite messy and very hard to understand for novices. I'm offering to rewriting it into more comsumable form. This description are prepared by Marco O1+Qwen merge model (which are very good & small) based on all the data i uploaded.

Available Merge Methods:

MergeKit offers a variety of merge methods that cater to different needs and computational capacities:

Linear (Model Soups):
- Description: A simple weighted average of model parameters.
- Use Cases: Best suited for combining models with similar architectures and initializations.
- Pros: Fast, resource-efficient, easy to implement.
- Cons: May not capture complex interactions between different models.
SLERP (Spherical Linear Interpolation):
- Description: Spherically interpolates between model parameters of two models.
- Use Cases: Ideal for fusing models with differing architectures or initializations but shares a base model.
- Pros: Preserves certain geometric properties, can handle more diversity in models.
- Cons: Requires careful parameter tuning; might be computationally intensive.
Task Arithmetic:
- Description: Computes task vectors by subtracting a base model's parameters from each source model and performs arithmetic operations on these vectors.
- Use Cases: Excellent for merging models that share a common ancestor, especially when fine-tuned for specific tasks.
- Pros: Encourages semantic meaningfulness in merged weights; effective for combining multiple specialized models.
- Cons: May require extensive computational resources.
TIES (Trim, Elect Sign & Merge):
- Description: Sparsifies task vectors and applies a sign consensus algorithm to reduce interference between models.
- Use Cases: Suitable when merging a large number of models while retaining their strengths.
- Pros: Can handle more complex scenarios with multiple models; preserves model diversity.
- Cons: More computationally demanding; requires careful parameter selection.
DARE (Drop And REscale):
- Description: Applies random pruning to task vectors followed by rescaling to retain important changes while reducing interference.
- Use Cases: Best for scenarios where maintaining key model features is crucial without overloading the system.
- Pros: Balances performance and resource usage effectively; retains critical aspects of merged models.
- Cons: May not be suitable for all types of models or tasks.
Model Breadcrumbs:
- Description: Extends task arithmetic by discarding both small and extremely large differences from the base model, enhancing sparsity.
- Use Cases: Ideal for merging multiple models with diverse characteristics while ensuring a balanced inclusion of their features.
- Pros: Efficiently integrates multiple models without significant loss in performance; handles varied architectures well.
- Cons: Requires detailed parameter tuning to achieve optimal results.
DELLA (DARE Enhanced Linear Approach):
- Description: Builds upon DARE by using adaptive pruning based on parameter magnitudes, followed by rescaling for final merging.
- Use Cases: Suitable when you need a fine-grained control over which aspects of the models to merge.
- Pros: Offers more nuanced control; can tailor the merged model's behavior closely to desired outcomes.
- Cons: More complex implementation; may require deeper computational resources.
Passthrough:
- Description: A no-op method that passes input tensors through unmodified, typically used for layer stacking or when only one input model is involved.
- Use Cases: Useful in scenarios where you want to stack multiple models sequentially without altering their parameters directly.
- Pros: Minimal computational overhead; straightforward implementation.
- Cons: Limited functionality; not suitable for merging two or more models comprehensively.

Recommendations Based on MergeKit's Capabilities:

For Beginners and Resource-Constrained Environments:
- Start with the Linear (Model Soups) method due to its simplicity and lower resource requirements.
- Example Model Pairing:
  - Models: GPT-NeoX, Llama
  - Method: Linear
For Advanced Users Seeking Enhanced Performance:
- Consider using the Task Arithmetic or TIES methods for more nuanced merging.
- Example Model Pairing:
  - Models: Mistral, StableLM
  - Method: Task Arithmetic
When Dealing with a Large Ensemble of Models:
- Utilize the DARE or Model Breadcrumbs methods to manage and integrate multiple models efficiently.
- Example Model Pairing:
  - Models: Multiple GPT-based models (e.g., GPT-NeoX, GPT-4)
  - Method: DARE
For Deep Architectural Merges:
- Explore the DELLA method for more adaptive and controlled merging across diverse architectures.
- Example Model Pairing:
  - Models: Transformer-based models with varying layer counts
  - Method: DELLA
Special Cases:
- If you're aiming to create a mixture of experts, look into MergeKit's Mixture of Experts (MoE) merging capabilities.
- Example Model Pairing:
  - Models: Sparse models with specialized layers for different tasks
  - Method: MoE
Utilizing GPU or CPU Execution:
- Leverage GPU acceleration if available, especially with methods that are computationally intensive like SLERP, Task Arithmetic, TIES, DARE, and DELLA.
- For CPU-based merging, the Linear method is more suitable due to its lower resource demands.

Additional Considerations:

Tokenizer Source: Ensure that all models share a compatible tokenizer or use MergeKit's tokenizer management features to handle discrepancies.
Parameter Specification: Flexibly specify parameters using tensor name filters for fine-grained control over which aspects of the models are merged.
Lazy Loading and Memory Management: Utilize Lazy loading with tensors to optimize memory usage, especially in resource-constrained environments.

Example Scenario:

Suppose you want to merge two transformer-based models—GPT-NeoX and Llama—to enhance the model's knowledge base while ensuring that their core architectures remain intact. You decide to use the Task Arithmetic method because it allows merging models fine-tuned for different tasks, preserving their strengths without significant performance degradation.

Model Selection:
- Primary Model: GPT-NeoX
- Secondary Model: Llama
Method Chosen: Task Arithmetic (task_arithmetic)
Execution Environment: Utilize a GPU-accelerated environment to leverage the computational efficiency of this method.
Configuration Parameters:
- Define task vectors by subtracting a base model's parameters from each source model.
- Perform linear interpolation on these vectors to derive merged weights that combine the strengths of both models.
Post-Merge Optimization:
- After merging, perform additional optimizations like pruning and fine-tuning if necessary to enhance performance further.

By following this structured approach, you can effectively utilize MergeKit to create a more robust and capable language model tailored to your specific needs.

Final Tips:

Experimentation: Start with simpler methods like Linear before moving on to more complex algorithms.
Documentation: Refer to MergeKit's extensive documentation for detailed explanations of each method and how they interact with different models.
Community Support: Engage with the community or forums related to MergeKit for support and insights from other users.

arcee-ai / mergekit

Rewrite readme more novice-friendly #462

Lazy Loading and Memory Management: Utilize Lazy loading with tensors to optimize memory usage, especially in resource-constrained environments.