elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.04k stars 112 forks source link
data-science dataframes elixir rust

Explorer

CI Documentation Package

Explorer brings series (one-dimensional) and dataframes (two-dimensional) for fast data exploration to Elixir.

Features and design

Explorer high-level features are:

The API is heavily influenced by Tidy Data and borrows much of its design from dplyr. The philosophy is heavily influenced by this passage from dplyr's documentation:

  • By constraining your options, it helps you think about your data manipulation challenges.

  • It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.

  • It uses efficient backends, so you spend less time waiting for the computer.

The aim here isn't to have the fastest dataframe library around (though it certainly helps that we're building on Polars, one of the fastest). Instead, we're aiming to bridge the best of many worlds:

That means you can expect the guiding principles to be 'Elixir-ish'. For example, you won't see the underlying data mutated, even if that's the most efficient implementation. Explorer functions will always return a new dataframe or series.

Getting started

Inside an Elixir script or Livebook:

Mix.install([
  {:explorer, "~> 0.8.0"}
])

Or in the mix.exs file of your application:

def deps do
  [
    {:explorer, "~> 0.8.0"}
  ]
end

Explorer will download a precompiled version of its native code upon installation. You can force a local build by setting the environment variable EXPLORER_BUILD=1 and including :rustler as a dependency:

  {:explorer, "~> 0.8.0", system_env: %{"EXPLORER_BUILD" => "1"}},
  {:rustler, ">= 0.0.0"}

If necessary, clean up before rebuilding with mix deps.clean explorer.

A glimpse of the API

We have two ways to represent data with Explorer:

A series can be created from a list:

fruits = Explorer.Series.from_list(["apple", "mango", "banana", "orange"])

Your newly created series is going to look like:

#Explorer.Series<
  Polars[4]
  string ["apple", "mango", "banana", "orange"]
>

And you can, for example, sort that series:

Explorer.Series.sort(fruits)

Resulting in the following:

#Explorer.Series<
  Polars[4]
  string ["apple", "banana", "mango", "orange"]
>

Dataframes

Dataframes can be created in two ways:

You can pass either series or lists to it:

mountains = Explorer.DataFrame.new(name: ["Everest", "K2", "Aconcagua"], elevation: [8848, 8611, 6962])

Your dataframe is going to look like this:

#Explorer.DataFrame<
  Polars[3 x 2]
  name string ["Everest", "K2", "Aconcagua"]
  elevation s64 [8848, 8611, 6962]
>

It's also possible to see a dataframe like a table, using the Explorer.DataFrame.print/2 function:

Explorer.DataFrame.print(mountains)

Prints:

+-------------------------------------------+
| Explorer DataFrame: [rows: 3, columns: 2] |
+---------------------+---------------------+
|        name         |      elevation      |
|      <string>       |        <s64>        |
+=====================+=====================+
| Everest             | 8848                |
+---------------------+---------------------+
| K2                  | 8611                |
+---------------------+---------------------+
| Aconcagua           | 6962                |
+---------------------+---------------------+

And now I want to show you how to filter our dataframe. But first, let's require the Explorer.DataFrame module and give a short name to it:

require Explorer.DataFrame, as: DF

The "require" is needed to load the macro features of that module. We give it a shorter name to simplify our examples.

Now let's go to the filter. I want to filter the mountains that are above the mean elevation in our dataframe:

DF.filter(mountains, elevation > mean(elevation))

You can see that we can refer to the columns using their names, and use functions without define them. This is possible due the powerful Explorer.Query features, and it's the main reason we need to "require" the Explorer.DataFrame module.

The result is going to look like this:

#Explorer.DataFrame<
  Polars[2 x 2]
  name string ["Everest", "K2"]
  elevation s64 [8848, 8611]
>

There is an extensive guide that you can play with Livebook: Ten Minutes to Explorer

You can also check the Explorer.DataFrame and Explorer.Series docs for further details.

Contributing

Explorer uses Rust for its default backend implementation, and while Rust is not necessary to use Explorer as a package, you need Rust tooling installed on your machine if you want to compile from source, which is the case when contributing to Explorer.

We require Rust Nightly, which can be installed with Rustup. If you already have Rustup and a recent version of Cargo installed, then the correct version of Rust is going to be installed in the first compilation of the project. Otherwise, you can manually install the correct version:

rustup toolchain install nightly-2024-02-23

You can also use asdf:

asdf install rust nightly-2024-02-23

It's possible that you may need to install CMake in order to build the project, if that is not already installed.

Once you have made your changes, run mix ci, to lint and format both Elixir and Rust code.

Our integration tests require the AWS CLI to be installed, and also a container engine that can be Podman or Docker.

Once these dependencies are installed, you need to run the mix localstack.setup command, and then run the cloud integration tests with mix test --only cloud_integration.

Just to recap, here is the combo of commands you need to run:

mix ci
mix localstack.setup
mix test --only cloud_integration

Precompilation

Explorer ships with the NIF code precompiled for the most popular architectures out there. We support the following:

This means that the Explorer is going to work without the need to compile it from source.

This currently only works for Hex releases. For more information on how it works, please check the RustlerPrecompiled project.

Legacy CPUs

We ship some of the precompiled artifacts with modern CPU features enabled by default. But in case your computer is not compatible with them, you can set an application environment that is going to be read at compile time, enabling the legacy variants of artifacts.

config :explorer, use_legacy_artifacts: true

Features disabled

Some of the features cannot be compiled to some targets, because one of the dependencies don't work on it.

This is the case for the NDJSON reads and writes, that don't work for the RISCV target. We also disable the AWS S3 reads and writes for the RISCV target, because one of the dependencies of ObjectStore does not compile on it.

Sponsors

Amplified