bge_finetuning addition

Walkthrough

## Walkthrough This pull request introduces support for a new task type, `"embedding_finetuning"`, alongside existing functionality for `"whisper_finetuning"`. Changes include modifications to conditional logic in various methods, the addition of a new field in a serializer, and the implementation of a new class for embedding fine-tuning. The updates enhance task handling, model configuration, and dataset management across multiple modules, ensuring that the system can process both fine-tuning tasks effectively. ## Changes | Files | Change Summary | |---------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `workflow/mixins.py` | Modified the `dispatch` method to include `"embedding_finetuning"` in the conditional logic for setting `workflow_id`. | | `workflow/serializers.py` | Enhanced `ModelDataSerializer` by adding a new field `guide_model` and including `"embedding_finetuning"` in the `training_task` choices. | | `workflow/training/__init__.py` | Added import for `EmbeddingFineTuning` class from the `embedding_finetuning` module. | | `workflow/training/embedding_finetuning.py` | Implemented `EmbeddingFineTuning` class for fine-tuning sentence embedding models, including methods for dataset loading, model preparation, and evaluation metrics. | | `workflow/training/onnx_conversion.py` | Renamed task key from `"embedding"` to `"embedding_finetuning"` in the `convert_to_onnx` function. | | `workflow/training/train.py` | Expanded the `upload_cache` function to return early for both `"whisper_finetuning"` and `"embedding_finetuning"` tasks. | | `workflow/training/utils.py` | Added support for `"embedding_finetuning"` in `get_task_class` and `get_model_class` functions. | | `workflow/utils.py` | Introduced task configuration for `"embedding_finetuning"` in `get_task_config` and updated `get_task_mapping` to include the new task. | ## Possibly related PRs - #134: Changes in `workflow/serializers.py` related to adding the `"embedding_finetuning"` choice to the `training_task` field are directly connected to the main PR, which introduces functionality for handling the `"embedding_finetuning"` task.

Recent review details

**Configuration used: CodeRabbit UI** **Review profile: CHILL**

Commits

Files that changed from the base of the PR and between 4e6ecd7577b802f98630b15a3fe941b174f465c5 and 369619ac569e9f2a1104ce3f122251837530d4e7.

Files selected for processing (8)

* workflow/mixins.py (1 hunks) * workflow/serializers.py (1 hunks) * workflow/training/__init__.py (1 hunks) * workflow/training/embedding_finetuning.py (1 hunks) * workflow/training/onnx_conversion.py (2 hunks) * workflow/training/train.py (1 hunks) * workflow/training/utils.py (3 hunks) * workflow/utils.py (2 hunks)

Additional context used

Ruff

workflow/training/__init__.py
4-4: `.whisper.WhisperFineTuning` imported but unused; consider removing, adding to `__all__`, or using a redundant alias (F401) --- 5-5: `.embedding_finetuning.EmbeddingFineTuning` imported but unused; consider removing, adding to `__all__`, or using a redundant alias (F401)

Additional comments not posted (18)

workflow/training/__init__.py (1)
`5-5`: **LGTM! The new import is approved.** The new import of `EmbeddingFineTuning` is added to the `__init__.py` file, which is typically used to define the public API of a package. Therefore, the import is likely intended to be used by other modules that import from this package. The Ruff warning about the unused import is a false positive in this context, as the import is not expected to be used within this file itself. It can be safely ignored.

Tools

Ruff
5-5: `.embedding_finetuning.EmbeddingFineTuning` imported but unused; consider removing, adding to `__all__`, or using a redundant alias (F401)

workflow/training/utils.py (3)
`52-52`: **LGTM!** The code change is approved. --- `66-66`: **LGTM!** The code change is approved. --- `4-4`: **LGTM, but verify the `EmbeddingFineTuning` class implementation.** The import statement is approved. However, ensure that the `EmbeddingFineTuning` class is correctly implemented in the `workflow/training/__init__.py` file. Run the following script to verify the class implementation:

Verification successful
--- **The `EmbeddingFineTuning` class is implemented correctly.** The class is defined in `workflow/training/embedding_finetuning.py` with an `__init__` method and a `load_dataset` method, confirming the validity of the import statement. - `workflow/training/embedding_finetuning.py`: Contains the `EmbeddingFineTuning` class implementation.

Scripts executed
The following scripts were executed for the analysis: Script: ```shell #!/bin/bash # Description: Verify the implementation of the `EmbeddingFineTuning` class. # Test: Search for the class definition. Expect: The class is defined. rg --type python -A 10 $'class EmbeddingFineTuning' ``` Length of output: 83 --- Script: ```shell #!/bin/bash # Description: Verify the implementation of the `EmbeddingFineTuning` class. # Test: Search for the class definition. Expect: The class is defined. rg --type py -A 10 $'class EmbeddingFineTuning' ``` Length of output: 994

workflow/training/onnx_conversion.py (2)
`57-57`: **LGTM!** The code change is approved. --- `71-71`: **LGTM!** The code change is approved.

workflow/training/embedding_finetuning.py (6)
`12-15`: **LGTM!** The `__init__` method is implemented correctly. It calls the parent class constructor and initializes the instance variables as expected. --- `17-33`: **LGTM!** The `load_dataset` method is implemented correctly. It loads and validates the dataset, performs the train-test split, and calls other methods for dataset preparation and model loading as expected. --- `35-38`: **LGTM!** The `_prepare_dataset` method is implemented correctly. It prepares the validation dataset by creating dictionaries for queries, corpus, and relevant documents as expected. --- `40-59`: **LGTM!** The `_load_model` method is implemented correctly. It loads the models, initializes the train loss and evaluator, and creates a partial function for the trainer as expected. --- `67-83`: **LGTM!** The `get_training_args` method is implemented correctly. It returns an instance of `SentenceTransformerTrainingArguments` with various training arguments initialized from the `req_data` and `args` instance variable as expected. --- `85-88`: **LGTM!** The `push_to_hub` method is implemented correctly. It pushes the trained model to the Hugging Face Hub using the `push_to_hub` method of the trainer's model as expected.

workflow/training/train.py (1)
`208-208`: **LGTM, but verify the impact of the change.** The change is approved as it correctly handles the new `"embedding_finetuning"` training task by returning early, similar to the existing handling of the `"whisper_finetuning"` task. However, since this change alters the control flow of the function, it's important to verify that this early return doesn't adversely impact other components or processes that rely on this function. Ensure that the caching functionality isn't expected for these training tasks in other parts of the system. To verify the impact of this change, consider: 1. Reviewing the codebase to identify all invocations of the `upload_cache` function and ensure that the early return for these training tasks is expected behavior. 2. Testing the system end-to-end with these training tasks to confirm that the overall functionality remains intact without the caching process.

workflow/mixins.py (1)
`172-172`: **LGTM!** The code change is approved. The updated condition allows the `dispatch` method to handle both `"whisper_finetuning"` and `"embedding_finetuning"` task types, which is consistent with the PR objective of introducing support for a new task type.

workflow/utils.py (2)
`317-330`: **LGTM!** The "embedding_finetuning" configuration follows the structure of other task configurations and includes the necessary parameters for embedding fine-tuning tasks. --- `354-354`: **LGTM!** The "embedding_finetuning" mapping is consistent with the schema example provided in the configuration and ensures that the system can correctly interpret and process data related to embedding fine-tuning tasks.

workflow/serializers.py (2)
`242-242`: **LGTM!** The code change is approved. --- `246-250`: **LGTM!** The code change is approved.

--- Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

Tips

### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (Invoked using PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. ### Other keywords and placeholders - Add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. - Add `@coderabbitai summary` to generate the high-level summary at a specific location in the PR description. - Add `@coderabbitai` anywhere in the PR title to generate the title automatically. ### CodeRabbit Configuration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

BharatSahAIyak / autotune

bge_finetuning addition #146

Summary by CodeRabbit