elixir-nx / axon

Nx-powered Neural Networks
Apache License 2.0
1.55k stars 103 forks source link

The loss increases until it become NaN on XOR example #592

Closed jn-jairo closed 3 months ago

jn-jairo commented 3 months ago

The "Modeling XOR with a neural network" example don't work.

The loss increases until it become NaN

modeling_xor_with_a_neural_network.livemd.zip


Modeling XOR with a neural network

Mix.install([
  {:axon, ">= 0.5.0"},
  {:exla, ">= 0.4.0"},
  {:kino_vega_lite, ">= 0.1.6"}
])

Nx.Defn.default_options(compiler: EXLA)

alias VegaLite, as: Vl
Resolving Hex dependencies...
Resolution completed in 0.254s
New:
  axon 0.6.1
  complex 0.5.0
  elixir_make 0.8.4
  exla 0.7.3
  fss 0.1.1
  kino 0.13.2
  kino_vega_lite 0.1.13
  nimble_pool 1.1.0
  nx 0.7.3
  polaris 0.1.0
  table 0.1.2
  telemetry 1.2.1
  vega_lite 0.1.9
  xla 0.6.0
* Getting axon (Hex package)
* Getting exla (Hex package)
* Getting kino_vega_lite (Hex package)
* Getting kino (Hex package)
* Getting table (Hex package)
* Getting vega_lite (Hex package)
* Getting fss (Hex package)
* Getting elixir_make (Hex package)
* Getting nimble_pool (Hex package)
* Getting nx (Hex package)
* Getting telemetry (Hex package)
* Getting xla (Hex package)
* Getting complex (Hex package)
* Getting polaris (Hex package)
==> table
Compiling 5 files (.ex)
Generated table app
==> vega_lite
Compiling 6 files (.ex)
Generated vega_lite app
===> Analyzing applications...
===> Compiling telemetry
==> fss
Compiling 4 files (.ex)
Generated fss app
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 35 files (.ex)
Compiling lib/nx/binary_backend.ex (it's taking more than 10s)
Generated nx app
==> kino
Compiling 49 files (.ex)
Generated kino app
==> kino_vega_lite
Compiling 4 files (.ex)
Generated kino_vega_lite app
==> polaris
Compiling 5 files (.ex)
Generated polaris app
==> axon
Compiling 23 files (.ex)
    warning: function __to_backend__/1 required by behaviour Nx.Defn.Compiler is not implemented (in module Axon.Defn)
    │
  1 │ defmodule Axon.Defn do
    │ ~~~~~~~~~~~~~~~~~~~~~~
    │
    └─ lib/axon/defn.ex:1: Axon.Defn (module)

Generated axon app
==> nimble_pool
Compiling 2 files (.ex)
Generated nimble_pool app
==> elixir_make
Compiling 8 files (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app
==> exla
Unpacking /home/jairo/.cache/xla/0.6.0/cache/download/xla_extension-x86_64-linux-gnu-cuda120.tar.gz into /home/jairo/.cache/mix/installs/elixir-1.17.2-erts-15.0.1/c26a92a7c6f01ac33e423c3bb57be660/deps/exla/cache
Using libexla.so from /home/jairo/.cache/xla/exla/elixir-1.17.2-erts-15.0.1-xla-0.6.0-exla-0.7.3-nuevz6sqxpsz6aqiw4vawazlmm/libexla.so
Compiling 26 files (.ex)
Compiling lib/exla/defn.ex (it's taking more than 10s)
Generated exla app

Introduction

In this notebook we try to create a model and learn it the logical XOR.

Even though XOR seems like a trivial operation, it cannot be modeled using a single dense layer (single-layer perceptron). The underlying reason is that the classes in XOR are not linearly separable. We cannot draw a straight line to separate the points $(0,0)$, $(1,1)$ from the points $(0,1)$, $(1,0)$. To model this properly, we need to turn to deep learning methods. Deep learning is capable of learning non-linear relationships like XOR.

The model

Let's start with the model. We need two inputs, since XOR has two operands. We then concatenate them into a single input vector with Axon.concatenate/3. Then we have one hidden layer and one output layer, both of them dense.

Note: the model is a sequential neural network. In Axon, we can conveniently create such a model by using the pipe operator (|>) to add layers one by one.

x1_input = Axon.input("x1", shape: {nil, 1})
x2_input = Axon.input("x2", shape: {nil, 1})

model =
  x1_input
  |> Axon.concatenate(x2_input)
  |> Axon.dense(8, activation: :tanh)
  |> Axon.dense(1, activation: :sigmoid)
#Axon<
  inputs: %{"x1" => {nil, 1}, "x2" => {nil, 1}}
  outputs: "sigmoid_0"
  nodes: 8
>

Training data

The next step is to prepare training data. Since we are modeling a well-defined operation, we can just generate random operands and compute the expected XOR result for them.

The training works with batches of examples, so we repeatedly generate a whole batch of inputs and the expected result.

batch_size = 32

data =
  Stream.repeatedly(fn ->
    key =
      :random.uniform(9999)
      |> Nx.Random.key()
    {x1, key} =
      key
      |> Nx.Random.randint(0, 2, shape: {batch_size, 1}, type: :u8)
    {x2, _next_key} =
      key
      |> Nx.Random.randint(0, 2, shape: {batch_size, 1}, type: :u8)

    y = Nx.logical_xor(x1, x2)

    {%{"x1" => x1, "x2" => x2}, y}
  end)
#Function<53.38948127/2 in Stream.repeatedly/1>

Here's how a sample batch looks:

Enum.at(data, 0)

21:45:45.686 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

21:45:45.687 [info] XLA service 0x7fbfbc2cc9d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

21:45:45.688 [info]   StreamExecutor device (0): NVIDIA GeForce MX150, Compute Capability 6.1

21:45:45.695 [info] Using BFC allocator.

21:45:45.695 [info] XLA backend allocating 1882128384 bytes on device 0 for BFCAllocator.

21:45:46.130 [info] Loaded cuDNN version 8907

21:45:46.386 [info] Using nvlink for parallel linking
{%{
   "x1" => #Nx.Tensor<
     u8[32][1]
     EXLA.Backend<cuda:0, 0.390520155.403308570.226530>
     [
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1]
     ]
   >,
   "x2" => #Nx.Tensor<
     u8[32][1]
     EXLA.Backend<cuda:0, 0.390520155.403308570.226532>
     [
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1]
     ]
   >
 },
 #Nx.Tensor<
   u8[32][1]
   EXLA.Backend<cuda:0, 0.390520155.403308570.226534>
   [
     [0],
     [0],
     [1],
     [1],
     [1],
     [1],
     [0],
     [1],
     [1],
     [0],
     [0],
     [1],
     [1],
     [0],
     [1],
     [0],
     [1],
     [1],
     [1],
     [0],
     [1],
     [1],
     [0],
     [0],
     [1],
     [0],
     [1],
     [0],
     [0],
     [1],
     [1],
     [0]
   ]
 >}

Training

It's time to train our model. In this case we use binary cross entropy for the loss and stochastic gradient descent as the optimizer. We use binary cross entropy because we can consider the task of computing XOR the same as a binary classification problem. We want our output to have a binary label 0 or 1, and binary cross entropy is typically used in these cases. Having defined our training loop, we run it with Axon.Loop.run/4.

epochs = 10

params =
  model
  |> Axon.Loop.trainer(:binary_cross_entropy, :sgd)
  |> Axon.Loop.run(data, %{}, epochs: epochs, iterations: 1000)
Epoch: 0, Batch: 950, loss: 0.7771471
Epoch: 1, Batch: 950, loss: NaN
Epoch: 2, Batch: 950, loss: NaN
Epoch: 3, Batch: 950, loss: NaN
Epoch: 4, Batch: 950, loss: NaN
Epoch: 5, Batch: 950, loss: NaN
Epoch: 6, Batch: 950, loss: NaN
Epoch: 7, Batch: 950, loss: NaN
Epoch: 8, Batch: 950, loss: NaN
Epoch: 9, Batch: 950, loss: NaN
%{
  "dense_0" => %{
    "bias" => #Nx.Tensor<
      f32[8]
      EXLA.Backend<cuda:0, 0.390520155.403570714.174403>
      [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN]
    >,
    "kernel" => #Nx.Tensor<
      f32[2][8]
      EXLA.Backend<cuda:0, 0.390520155.403570714.174404>
      [
        [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN],
        [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN]
      ]
    >
  },
  "dense_1" => %{
    "bias" => #Nx.Tensor<
      f32[1]
      EXLA.Backend<cuda:0, 0.390520155.403570714.174405>
      [NaN]
    >,
    "kernel" => #Nx.Tensor<
      f32[8][1]
      EXLA.Backend<cuda:0, 0.390520155.403570714.174406>
      [
        [NaN],
        [NaN],
        [NaN],
        [NaN],
        [NaN],
        [NaN],
        [NaN],
        [NaN]
      ]
    >
  }
}

Trying the model

Finally, we can test our model on sample data.

Axon.predict(model, params, %{
  "x1" => Nx.tensor([[0]]),
  "x2" => Nx.tensor([[1]])
})
#Nx.Tensor<
  f32[1][1]
  EXLA.Backend<cuda:0, 0.390520155.403570714.174411>
  [
    [NaN]
  ]
>

Try other combinations of $x_1$ and $x_2$ and see what the output is. To improve the model performance, you can increase the number of training epochs.

Visualizing the model predictions

The original XOR we modeled only works with binary values $0$ and $1$, however our model operates in continuous space. This means that we can give it $x_1 = 0.5$, $x_2 = 0.5$ as input and we expect some output. We can use this to visualize the non-linear relationship between inputs $x_1$, $x_2$ and outputs that our model has learned.

# The number of points per axis, determines the resolution
n = 50

# We generate coordinates of inputs in the (n x n) grid
x1 = Nx.iota({n, n}, axis: 0) |> Nx.divide(n) |> Nx.reshape({:auto, 1})
x2 = Nx.iota({n, n}, axis: 1) |> Nx.divide(n) |> Nx.reshape({:auto, 1})

# The output is also a real number, but we round it into one of the two classes
y = Axon.predict(model, params, %{"x1" => x1, "x2" => x2}) |> Nx.round()

Vl.new(width: 300, height: 300)
|> Vl.data_from_values(
  x1: Nx.to_flat_list(x1),
  x2: Nx.to_flat_list(x2),
  y: Nx.to_flat_list(y)
)
|> Vl.mark(:circle)
|> Vl.encode_field(:x, "x1", type: :quantitative)
|> Vl.encode_field(:y, "x2", type: :quantitative)
|> Vl.encode_field(:color, "y", type: :nominal)

From the plot we can clearly see that during training our model learnt two clean boundaries to separate $(0,0)$, $(1,1)$ from $(0,1)$, $(1,0)$.

jn-jairo commented 3 months ago

Alright, installing from github fixed it.

Mix.install([
  {:axon, github: "elixir-nx/axon", override: true},
  {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
  {:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
  {:kino_vega_lite, ">= 0.1.6"}
])

So it is just the version on hex that has the problem.