LukasHedegaard / datasetops

Fluent dataset operations, compatible with your favorite libraries
https://datasetops.readthedocs.io
MIT License
10 stars 1 forks source link
data-cleaning data-munging data-processing data-science data-wrangling dataset dataset-combinations deep-learning multiple-datasets pytorch tensorflow

Dataset Ops: Fluent dataset operations, compatible with your favorite libraries

Python package Documentation Status codecov Code style: black

Dataset Ops provides a fluent interface for loading, filtering, transforming, splitting, and combining datasets. Designed specifically with data science and machine learning applications in mind, it integrates seamlessly with Tensorflow and PyTorch.

Appetizer

import datasetops as do

# prepare your data
train, val, test = (
    do.from_folder_class_data('path/to/data/folder')
    .named("data", "label")
    .image_resize((240, 240))
    .one_hot("label")
    .shuffle(seed=42)
    .split([0.6, 0.2, 0.2])
)

# use with your favorite framework
train_tf = train.to_tensorflow() 
train_pt = train.to_pytorch() 

# or do your own thing
for img, label in train:
    ...

Installation

Binary installers available at the Python package index

pip install datasetops

Why?

Collecting and preprocessing datasets is tiresome and often takes upwards of 50% of the effort spent in the data science and machine learning lifecycle. While Tensorflow and PyTorch have some useful datasets utilities available, they are designed specifically with the respective frameworks in mind. Unsuprisingly, this makes it hard to switch between them, and training-ready dataset definitions are bound to one or the other. Moreover, they do not aid you in standard scenarios where you want to:

Dataset Ops aims to make these processing steps easier, faster, and more intuitive to perform, while retaining full compatibility to and from the leading libraries. This also means you can grab a dataset from torchvision datasets and use it directly with tensorflow:

import do
import torchvision

torch_usps = torchvision.datasets.USPS('../dataset/path', download=True)
tensorflow_usps = do.from_pytorch(torch_usps).to_tensorflow()

Development Status

The library is still under heavy development and the API may be subject to change.

What follows here is a list of implemented and planned features.

Loaders

Converters

Dataset information

Sampling and splitting

Item manipulation

Dataset combinations

Citation

If you use this software, please cite it as below:

@software{Hedegaard_DatasetOps_2022,
  author = {Hedegaard, Lukas and Oleksiienko, Illia and Legaard, Christian Møldrup},
  doi = {10.5281/zenodo.7223644},
  month = {10},
  title = {{DatasetOps}},
  version = {0.0.7},
  year = {2022}
}