Icelk / std-dev

Your Swiss Army knife for swiftly processing any amount of data.
8 stars 1 forks source link
regression rust-lang statistics statistics-learning

std-dev

Your Swiss Army knife for swiftly processing any amount of data. Implemented for industrial and educational purposes alike.

This codebase is well-documented and commented, in an effort to expose the wonderful algorithms of data analysis to the masses.

We're ever expanding, but for now the following are implemented.

Usage

This application supports using it both as a library (with optional cargo features), an interactive CLI program, and through piping data to it, through standard input.

It accepts any comma/space separated values. Scientific notation is supported. This is minimalistic by design, as other programs may be used to produce/modify the data before it's processed by us.

Shell completion

Using the subcommand completion, std-dev automatically generates shell completions for your shell and tries to put them in the appropriate location.

When using Bash or Zsh, you should run std-dev as root, as we need root privileges to write to their completion directories. Alternatively, use the --print option to yourself write the completion file.

Cargo features

When using this as a library, I recommend disabling all features (except base) (std-dev = { version = "0.1", default-features = false, features = ["base"] }) and enabling those you need.

Documentation

Documentation of the main branch can be found at doc.icelk.dev.

To document with information on which cargo features enables the code, set the environment variable RUSTDOCFLAGS to --cfg docsrs (e.g. in Fish set -x RUSTDOCFLAGS "--cfg docsrs") and then run cargo +nightly doc.

Performance

This library aims to be as fast as possible while maintaining easily readable code.

Clusters

As all algorithms are executed in linear time now, this is not as useful, but nevertheless an interesting feature. If you already have clustered data, this feature is great.

When using the clusters feature (turning your list into a ClusterList), calculations are done per unique value. Say you have a dataset of infant height, in centimeters. That's probably only going to be some 40 different values, but potentially millions of entries. Using clusters, all that data is only processed as O(40), not O(millions). (I know that notation isn't right, but you get my point).

Creating this cluster involves adding all the values to a map. This takes O(n) time, but is very slow compared to all other algorithms. After creation, most operations in this library are executed in O(m) time, where m is the count of unique values.