facebookresearch / shumai

Fast Differentiable Tensor Library in JavaScript and TypeScript with Bun + Flashlight
https://facebookresearch.github.io/shumai
MIT License
1.13k stars 26 forks source link
Shumai

A fast, network-connected, differentiable tensor library for TypeScript (and JavaScript). Built with bun + flashlight for software engineers and researchers alike.

shumai_big

⚠️ This is experimental software! ⚠️

docs build tests npm Discord GitHub commit activity GitHub


Quickstart

Install Bun and ArrayFire

For MacOS users: You can use [Homebrew](https://brew.sh) to install ArrayFire: ```bash curl https://bun.sh/install | bash brew install arrayfire ```
For Linux users: If you're running Ubuntu with **x86-64**, you can use the official distribution: ```bash curl https://bun.sh/install | bash sudo apt install -y gnupg2 ca-certificates sudo apt-key adv --fetch-key https://repo.arrayfire.com/GPG-PUB-KEY-ARRAYFIRE-2020.PUB echo "deb https://repo.arrayfire.com/debian all main" | sudo tee /etc/apt/sources.list.d/arrayfire.list sudo apt update sudo apt install -y arrayfire-cpu3-dev arrayfire-cpu3-openblas ``` If you're running Ubuntu with **ARMv8**, you'll need to build from source: ```bash curl https://bun.sh/install | bash sudo apt remove libarrayfire-dev libarrayfire-cpu3 libarrayfire-cpu-dev sudo apt install -y libblas-dev liblapack-dev liblapacke-dev libfftw3-dev libboost-all-dev cmake make g++ cd /tmp sudo rm -rf arrayfire git clone https://github.com/arrayfire/arrayfire.git cd arrayfire cmake -Bbuild -DAF_BUILD_EXAMPLES=OFF -DCMAKE_BUILD_TYPE=Release -DAF_BUILD_UNIFIED=OFF -DAF_TEST_WITH_MTX_FILES=OFF -DBUILD_TESTING=OFF make -j4 -Cbuild sudo make install -Cbuild ``` Otherwise, see the official [ArrayFire installation guide.](https://github.com/arrayfire/arrayfire/wiki/Getting-ArrayFire)

then run:

bun install @shumai/shumai

Only macOS and Linux are supported. Linux installs default to GPU computation with CUDA, and macOS to CPU. Detailed install instructions below.

Install is work in progress: please file an issue if you run into problems.

Usage

shumai will always attempt to use an attached GPU or accelerator; although CPU computation will use the ArrayFire CPU backend, which is not well-optimized.

We hope to support the ArrayFire OpenCL backend and other non-ArrayFire tensor backends soon.

If shumai seems unusually slow, please file an issue!

Standard array utilities:

import * as sm from "@shumai/shumai"

// create a 1024 by 1024 tensor, randomly filled with normal distribution
let X = sm.randn([1024, 1024])
let W = sm.identity(1024)
let Y = X.matmul(W)
console.log(Y.shape)

Conversion to and from JavaScript native arrays:

const data : Float32Array = new Float32Array(128)
for (let i = 0; i < 128; ++i) {
  data[i] = Math.random()
}

const X : Tensor = sm.tensor(data)
const pi = sm.scalar(3.14)
const Y = X.mul(pi)

// tensors can be converted back to native JavaScript
const Y_data = Y.toFloat32Array()

// scalar tensors can be converted to JavaScript numbers
const total : number = X.sum().toFloat32()

Gradients:

const W = sm.randn([128, 128])
W.requires_grad = true

const X = sm.randn([128, 128])
const diff = X.sub(W)
const mse = diff.mul(diff).sum()
mse.backward()

W.grad // this gradient is now populated

// copy W without allowing gradient updates
const Y = W.detach()
Y.sum().backward() // nothing changes

Some more examples can be found here.

Supported operators can be found here.

Install

The install procedure is a work in progress! If you have any problems building or installing, we would greatly appreciate filed issues. Please tell us about your platform/OS when you do.

Prerequisites:

Once bun and ArrayFire are installed, install the package and backing libs with bun:

bun install @shumai/shumai

Windows Support

While not officially supported, Windows users have been successful leveraging Docker + WSL2 + Linux. Including CUDA support.

Building Native Libraries from Source

Note: not required when developing TypeScript/Javascript library components locally.

From source build instructions for:

This process will build the dependent ffi libraries (libflashlight and libflashlight_binding) and pack them using npm pack to generate a @shumai/shumai_*.tgz package. You can then use npm install $PATH_TO_SOURCE/@shumai/shumai-*.tgz to install the package where you'd like.

Building on macOS from Source

First, install ArrayFire CPU with brew install arrayfire.

Build and install Flashlight:

mkdir -p $HOME/usr/ # installing flashlight here
git clone --recursive --depth 1 https://github.com/flashlight/flashlight.git
cd flashlight
mkdir -p build
cd build
cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=ON  \
  -DCMAKE_INSTALL_PREFIX=$HOME/usr \
  -DFL_USE_ARRAYFIRE=ON \
  -DFL_ARRAYFIRE_USE_CPU=ON \
  -DFL_USE_ONEDNN=OFF \
  -DFL_BUILD_DISTRIBUTED=OFF \
  -DFL_BUILD_TESTS=OFF \
  -DFL_BUILD_EXAMPLES=OFF
make -j$(nproc)
make install

Build Flashlight bindings for Shumai:

cd shumai
mkdir -p build
cd build
cmake .. -Dflashlight_DIR=$HOME/usr/share/flashlight/cmake/
make -j$(nproc)

Profiling

On macOS, you can record perf with xcrun xctrace record --template "Time Profiler" --launch $(which bun) train.js.

Building on Linux from Source

First install ArrayFire. The Linux build for shumai uses the CUDA backend, but from source, you can build the CPU backend as well (OpenCL support coming soon).

Build and install Flashlight:

mkdir -p $HOME/usr/ # installing flashlight here
git clone --recursive --depth 1 https://github.com/flashlight/flashlight.git
cd flashlight
mkdir -p build
cd build
cmake .. \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \ # or as specified
  -DFL_ARRAYFIRE_USE_CPU=OFF \
  \ # swap with the above to build for CPU
  -DFL_ARRAYFIRE_USE_CUDA=ON \ 
  -DFL_BUILD_DISTRIBUTED=OFF \
  -DFL_USE_ONEDNN=OFF \
  -DFL_BUILD_TESTS=OFF \
  -DFL_BUILD_EXAMPLES=OFF \
  -DFL_BUILD_SCRIPTS=OFF \
  -DCMAKE_INSTALL_PREFIX=$HOME/usr/
make -j$(nproc)
make install

Build bindings for shumai:

mkdir -p build && cd build
cmake .. \
    -DBUILD_SHARED_LIBS=ON \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \ # or as specified
    -Dflashlight_DIR=${FLASHLIGHT_INSTALL_PREFIX}/share/flashlight/cmake \
    -DArrayFire_DIR=${ARRAYFIRE_INSTALL_PREFIX}/share/ArrayFire/cmake # if built from source, else not needed
make -j$(nproc)

Why build this?

With Shumai, we hope to make

Benchmarks

Benchmark data is collected from https://github.com/shumai-org/benchmarks

On an Apple M1 Pro:

Benchmark Shumai (bun) TF.js (node) Difference
32-wide addition 624.78K iter/s 195.627K iter/s 3.19x
1024-wide addition 460.008K iter/s 94.945K iter/s 4.84x
32768-wide addition 57.929K iter/s 40.484K iter/s 1.43x
64-wide square matmul 43 GFlop/s 28.533 GFlop/s 1.51x
128-wide square matmul 518.704 GFlop/s 58.764 GFlop/s 8.83x
1024-wide square matmul 2,147.771 GFlop/s 318.826 GFlop/s 6.74x
B=64, 64-wide hidden layer + 5x pointwise 41.344K iter/s 16.679K iter/s 2.48x
B=64, 128-wide hidden layer + 5x pointwise 24.554K iter/s 8.563K iter/s 2.87x
B=64, 1024-wide hidden layer + 5x pointwise 2.716K iter/s 0.969K iter/s 2.80x

On an Nvidia GP100:

Benchmark Shumai (bun) TF.js (node) Difference
32-wide addition 243.217K iter/s 34.539K iter/s 7.04x
1024-wide addition 144.771K iter/s 18.006K iter/s 8.04x
32768-wide addition 71.793K iter/s 17.071K iter/s 4.21x
64-wide square matmul 63.239 GFlop/s 12.749 GFlop/s 4.96x
128-wide square matmul 435.565 GFlop/s 104.885 GFlop/s 4.15x
1024-wide square matmul 7,165.062 GFlop/s 6,470.793 GFlop/s 1.11x
B=64, 64-wide hidden layer + 5x pointwise 25.507K iter/s 5.192K iter/s 4.91x
B=64, 128-wide hidden layer + 5x pointwise 22.529K iter/s 4.861K iter/s 4.63x
B=64, 1024-wide hidden layer + 5x pointwise 11.568K iter/s 2.854K iter/s 4.05x

Memory Usage

While the out of the box memory management may suffice in many cases, tuning memory usage can greatly improve performance by reducing unnecessary overhead from the Garbage Collector.

import { util } from '@shumai/shumai'

util.memoryOptions({
  lowerBoundThreshold: 100e6, // 100MB
  upperBoundThreshold: 5e9, // 5GB
  delayBetweenGCs: 1000 // 1s
})

Pay special attention to upperBoundThreshold which if exceeded will force GC for every allocated tensor, ignoring delayBetweenGCs. Supplying a value that will fully utilize your hardware can greatly improve performance.

Statistics

graph TD
  OpA(Op A) --> statsA{{"stats A"}};
  OpB(Op B) --> statsA;
  statsA --> LoggerA{{"LoggerConsole A"}};
  LoggerA --> Stdout(("Stdout"));
  OpC(Op C) --> statsA;
  OpD(Op D) --> statsA;
  statsA --> LoggerB("LoggerCustom B");
  LoggerB --> Disk(("Disk"));

Basic usage of gathering statistics is as simple adding a collector using the default StatsLoggerConsole.

import { stats, StatsLoggerConsole, rand, matmul } from '@shumai/shumai'

stats.enabled = true // all ops following will capture stats

// perform ops...

stats.enabled = false // all ops following will no longer capture stats

While the above examples may suffice for simple use cases, if you're looking to capture stats across multiple threads, processes, and/or hosts, StatsLoggerHttp has you covered.

graph TD
  subgraph Host C
    Processor("LoggerHttp Processor")
    style Processor stroke:#222,stroke-width:4px,stroke-dasharray:5 5
  end
  subgraph Host A
    OpA(Op A) --> statsA{{"stats A"}};
    OpB(Op B) --> statsA;
    statsA --> LoggerA{{"LoggerHttp A"}};
    LoggerA --> Processor;
  end
  subgraph Host B
    OpC(Op C) --> statsB{{"stats B"}};
    OpD(Op D) --> statsB;
    statsB --> LoggerB{{"LoggerHttp B"}};
    LoggerB --> Processor;
  end
import { StatsLoggerHttp } from '@shumai/shumai'

stats.logger = new StatsLoggerHttp({ url: 'http://localhost:4242' })

For more custom needs you can supply your own logger:

import { StatsLogger, StatsLoggerData } from '@shumai/shumai'

class CustomLogger implements StatsLogger {
  async process(data: StatsLoggerData): Promise<void> {
    const summary = data.collector.getSummary()
    console.log('Collector stats:', summary)
  }
}

stats.logger = new CustomLogger()

By default stack tracing is disabled as it adds 50%+ overhead, but can be enabled via stats.collectStacks = true.

Scoped Statistics

If you wish to isolate stats profiling you can do this as well:

import { collectStats } from '@shumai/shumai'

const scopedStats = collectStats(() => {
  // perform ops...
}/*, StatsCollectorOptions | StatsLogger */)
console.log(scopedStats.getSummary())

Contributing

If you'd like to make changes to the core bindings or ffi, first build from source.

All files ending in *.inl or *_gen.ts are generated. These can be modified by editing scripts/gen_binding.py and running ./scripts/gen_all_binding.sh.

See the CONTRIBUTING file for style guidance and more info on how to help out. 😁

License

shumai is MIT licensed, as found in the LICENSE file.