Oxen is a lightning fast data version control system for structured and unstructured machine learning datasets. We aim to make versioning datasets as easy as versioning code.
The interface mirrors git, but shines in many areas that git or git-lfs fall short. Oxen is built from the ground up for data, and is optimized to handle large datasets, and large files.
oxen init
oxen add images/
oxen add annotations/*.parquet
oxen commit "Adding 200k images and their corresponding annotations"
oxen push origin main
Oxen is comprised of a command line interface, as well as bindings for Rust ๐ฆ, Python ๐, and HTTP interfaces ๐ to make it easy to integrate into your workflow.
Oxen is designed to efficiently manage large datasets, including those with large individual files, for example CSV files with millions of rows. It also handles datasets comprising millions of individual files and directories such as the complete collection of ImageNet images.
One of the main reasons datasets are hard to maintain is the pure performance of indexing the data and transferring the data over the network. We wanted to be able to index hundreds of thousands of images, videos, audio files, and text files in seconds.
Watch below as we version hundreds of thousands of images in seconds ๐ฅ
But speed is only the beginning.
Oxen is built around ergonomics, ease of use, and it is easy to learn. If you know how to use git, you know how to use Oxen.
To learn what everything Oxen can do, the full documentation can be found at https://docs.oxen.ai.
You can install through homebrew or pip or from our releases page.
brew tap Oxen-AI/oxen
brew install oxen
pip install oxenai
Clone your first Oxen repository from the OxenHub.
## ๐ค Support If you have any questions, comments, suggestions, or just want to get in contact with the team, feel free to email us at `hello@oxen.ai` ## ๐ฅ Contributing This repository contains the Python library that wraps the core Rust codebase. We would love help extending out the python interfaces, the documentation, or the core rust library. Code bases to contribute to: * ๐ฆ [Core Rust Library](https://github.com/Oxen-AI/Oxen) * ๐ [Python Interface](https://github.com/Oxen-AI/oxen-release/tree/main/oxen) * ๐ [Documentation](https://github.com/Oxen-AI/docs) If you are building anything with Oxen.ai or have any questions we would love to hear from you in our [discord](https://discord.gg/s3tBEn7Ptg). ## Build ๐จ Set up virtual environment: ```Bash # Set up your python virtual environment $ python -m venv ~/.venv_oxen # could be python3 $ source ~/.venv_oxen/bin/activate $ pip install maturin ``` ```Bash # Install rust $ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Run maturin $ maturin develop ``` ## Test ```Bash $ pytest -s tests/ ``` ## Why build Oxen? Oxen was build by a team of machine learning engineers, who have spent countless hours in their careers managing datasets. We have used many different tools, but none of them were as easy to use and as ergonomic as we would like. If you have ever tried [git lfs](https://git-lfs.com/) to version large datasets and became frustrated, we feel your pain. Solutions like git-lfs are too slow when it comes to the scale of data we need for machine learning. If you have ever uploaded a large dataset of images, audio, video, or text to a cloud storage bucket with the name: `s3://data/images_july_2022_final_2_no_really_final.tar.gz` We built Oxen to be the tool we wish we had. ## Why the name Oxen? "Oxen" ๐ comes from the fact that the tooling will plow, maintain, and version your data like a good farmer tends to their fields ๐พ. Let Oxen take care of the grunt work of your infrastructure so you can focus on the higher-level ML problems that matter to your product. [Learn The Basics]: https://img.shields.io/badge/Learn_The_Basics-37a779?style=for-the-badge