AI-Northstar-Tech / vector-io

Comprehensive Vector Data Tooling. The universal interface for all vector database, datasets and RAG platforms. Easily export, import, backup, re-embed (using any model) or access your vector data from any vector databases or repository.
https://vector-io.com
Apache License 2.0
222 stars 27 forks source link

feat : adding mongodb provider #108

Closed vipul-maheshwari closed 1 month ago

vipul-maheshwari commented 1 month ago

Add MongoDB Export Functionality

✨ Generated with love by Kaizen ❤️

Original Description # Add MongoDB Export Functionality - ****Purpose:** ** Introduce functionality to export data from MongoDB to a specified format. - ****Key Changes:**** - Added `ExportMongoDB` class for exporting data from MongoDB collections. - Implemented command-line argument parsing for MongoDB connection details. - Included methods for data flattening and exporting to parquet format. - Updated `.gitignore` to exclude testing and environment files. - Added MongoDB entry to `DBNames` for consistency in naming. - ****Impact:** ** This enhancement allows users to seamlessly export data from MongoDB, improving data integration capabilities. > ✨ Generated with love by [Kaizen](https://cloudcode.ai) ❤️
Original Description - [ ] Export script - [ ] Import script
---- > [!IMPORTANT] > Adds MongoDB export functionality with BSON handling and vector dimension detection, and updates configuration for MongoDB support. > > - **Export Functionality**: > - Adds `ExportMongoDB` class in `mongodb_export.py` for exporting data from MongoDB. > - Handles BSON types like `ObjectId`, `Binary`, `Regex`, `Timestamp`, `Decimal128`, and `Code` in `flatten_dict()`. > - Detects vector dimensions if not provided, and exports data in batches to Parquet files. > - **Configuration**: > - Adds `MONGODB` to `DBNames` in `names.py`. > - Updates `db_metric_to_standard_metric` in `util.py` to include MongoDB with `cosine` and `euclidean` distances. > - **Import Functionality**: > - Placeholder for MongoDB import in `mongodb_import.py`. > > This description was created by [Ellipsis](https://www.ellipsis.dev?ref=AI-Northstar-Tech%2Fvector-io&utm_source=github&utm_medium=referral) for f343642aad87faf412befd105451df8ad90dc997. It will automatically update as commits are pushed.
kaizen-bot[bot] commented 1 month ago

🔍 Code Review Summary

Attention Required: This push has potential issues. 🚨

Overview

performance (1 issues)
_ 1. Inefficient handling of large datasets in get_data method._ ------ 📁 **File:** [src/vdf_io/export_vdf/mongodb_export.py](src/vdf_io/export_vdf/mongodb_export.py#L218) 🔍 **Reasoning:** The current implementation loads all documents into memory at once using `list(cursor)`, which can lead to high memory usage for large collections. 💡 **Solution:** Process documents in a streaming manner to reduce memory footprint. **Current Code:** ```python batch_data = list(cursor) ``` **Suggested Code:** ```python for document in cursor: flat_doc = self.flatten_dict(document) flattened_data.append(flat_doc) ```

✨ Generated with love by Kaizen ❤️

Useful Commands - **Feedback:** Share feedback on kaizens performance with `!feedback [your message]` - **Ask PR:** Reply with `!ask-pr [your question]` - **Review:** Reply with `!review` - **Update Tests:** Reply with `!unittest` to create a PR with test changes