facebookresearch / FBTT-Embedding

This is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed this library can reduce the total model size by up to 100x in Facebook’s open sourced DLRM model while achieving same model quality. Our implementation is faster than the state-of-the-art implementations. Existing the state-of-the-art library also decompresses the whole embedding tables on the fly therefore they do not provide memory reduction during runtime of the training. Our library decompresses only the requested rows therefore can provide 10,000 times memory footprint reduction per embedding table. The library also includes a software cache to store a portion of the entries in the table in decompressed format for faster lookup and process.
MIT License
192 stars 27 forks source link

cudaErrorIllegalAddress occurs when using TTEmbeddingBag and nn.EmbeddingBag at same time #16

Closed fumihwh closed 3 years ago

fumihwh commented 3 years ago

As title says, a cudaErrorIllegalAddress occurs when using TTEmbeddingBag and nn.EmbeddingBag at same time. I'v added self.cache_populate() to TableBatchedTTEmbeddingBag forward method just after self.update_cache(indices).

    def forward(
        self, indices: torch.Tensor, offsets: torch.Tensor, warmup: bool = True
    ) -> torch.Tensor:
        (indices, offsets) = indices.long(), offsets.long()

        # update hash table and lfu state
        self.update_cache(indices)
        self.cache_populate()
Cases nn.EmbCls Error
self.cache_populate() in TableBatchedTTEmbeddingBag forward with nn.EmbeddingBag cudaErrorIllegalAddress
Call cache_populate after backward done with nn.EmbeddingBag RuntimeError: CUDA error: invalid device ordinal
self.cache_populate() in TableBatchedTTEmbeddingBag forward without nn.EmbeddingBag
Call cache_populate after backward done without nn.EmbeddingBag
self.cache_populate() in TableBatchedTTEmbeddingBag forward with nn.Embedding
Call cache_populate after backward done with nn.Embedding

Snippets

import torch
from torch import nn
import numpy as np
from tt_embeddings_ops import TTEmbeddingBag, OptimType
import torch.nn.functional as F

vocabulary_size = 1000
embedding_dim = 4
TT_RANK = 8
NUM_TT_CORES = 3
tt_ranks = [TT_RANK] * (NUM_TT_CORES - 1)
batch_size = 100
device = 0
use_cache = True
cache_size = vocabulary_size

use_nn_emb = True

class MyModel(nn.Module):

  def __init__(self):
    super(MyModel, self).__init__()
    self.emb1 = TTEmbeddingBag(
        vocabulary_size,
        embedding_dim,
        tt_ranks,
        None,  # tt_p_shapes,
        None,  # tt_q_shapes,
        OptimType.EXACT_ADAGRAD,
        sparse=True,
        use_cache=use_cache,
        cache_size=cache_size,
        learning_rate=0.01,
    ).to(device)
    self.emb2 = TTEmbeddingBag(
        vocabulary_size,
        embedding_dim,
        tt_ranks,
        None,  # tt_p_shapes,
        None,  # tt_q_shapes,
        OptimType.EXACT_ADAGRAD,
        sparse=True,
        use_cache=use_cache,
        cache_size=cache_size,
        learning_rate=0.01,
    ).to(device)
    self.emb3 = nn.EmbeddingBag(vocabulary_size, embedding_dim,
                                mode="sum").to(device)

    self.l = nn.Linear(embedding_dim * (3 if use_nn_emb else 2), 5).to(device)

  def forward(self, x, offsets):
    offsets_ori = torch.tensor(np.array(range((x.shape[0]))).astype(np.int64),
                               device=device).flatten()
    rs = [
        self.emb1.forward(x[:, 0], offsets),
        self.emb2.forward(x[:, 1], offsets)
    ]
    if use_nn_emb:
      rs.append(self.emb3.forward(x[:, 2], offsets_ori))
    return self.l(torch.cat(rs, dim=1))

model = MyModel()
model.train()
for e in range(10):
  grad_output = torch.rand(batch_size, 5, device=device) * 0.1
  x = torch.randint(0,
                    vocabulary_size - 10, (batch_size, 3 if use_nn_emb else 2),
                    device=device)
  offsets = torch.tensor(np.array(range((x.shape[0] + 1))).astype(np.int64),
                         device=device).flatten()
  y = model(x, offsets)
  y.backward(grad_output)

Env

I use docker image pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel.

PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.10

Python version: 3.7 (64-bit runtime)
Python platform: Linux-4.19.95-17-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 460.27.04
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.9.0
[pip3] torchelastic==0.2.0
[pip3] torchtext==0.10.0
[pip3] torchvision==0.10.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.2.0           h06a4308_296
[conda] mkl-service               2.3.0            py37h27cfd23_1
[conda] mkl_fft                   1.3.0            py37h42c9631_2
[conda] mkl_random                1.2.1            py37ha9443f7_2
[conda] numpy                     1.19.5                   pypi_0    pypi
[conda] pytorch                   1.9.0           py3.7_cuda11.1_cudnn8.0.5_0    pytorch
[conda] torchelastic              0.2.0                    pypi_0    pypi
[conda] torchtext                 0.10.0                     py37    pytorch
[conda] torchvision               0.10.0               py37_cu111    pytorch
fumihwh commented 3 years ago

~UPDATE: Should call cache_populate after backward.~

fumihwh commented 3 years ago

If I call cache_populate after backward, RuntimeError: CUDA error: invalid device ordinal was occurred.

fumihwh commented 3 years ago

Should use same sparse param in TTEmbeddingBag and nn.EmbeddingBag