Correct Integration of TT-Embedding to DLRM

TimJZ commented 3 years ago

Hi, I'm currently trying to integrate the TT-Embedding to the original DLRM code base, and I've successfully reproduced the result shown in readme. However, I'm not quite sure what are the essential changes to make.

Right now I'm replacing the original embeddingbag function (within the create_emb in dlrm_s_pytorch.py file) in DLRM with TTEmbeddingBag, but have trouble figuring out the correct parameters for the function. The parameters I used right now is:

               EE = TTEmbeddingBag(
                    n,
                    m,
                    tt_ranks=[12,14],
                    sparse=False,
                    use_cache=False,
                    weight_dist="uniform"
                )

I left the tt_p_shapes and tt_q_shapes to blank since each layer's embedding dimension and number of embeddings are different. The paper mentioned that the TT-Rank used was [8, 16, 32, 64], but I wasn't able to use that parameter, since it would result in failure of passing assertion len(self.tt_p_shapes) <= 4. Therefore I used the same parameters in example ([12,14]).

And that result a CUDA illegal memory access error at line 174 in tt_embedding_ops. Full error message is attached below:

Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 1013, in <module>
    Z = dlrm_wrap(X, lS_o, lS_i, use_gpu, device)
  File "dlrm_s_pytorch.py", line 866, in dlrm_wrap
    return dlrm(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "dlrm_s_pytorch.py", line 385, in forward
    return self.parallel_forward(dense_x, lS_o, lS_i)
  File "dlrm_s_pytorch.py", line 470, in parallel_forward
    ly = self.apply_emb(lS_o, lS_i, self.emb_l)
  File "dlrm_s_pytorch.py", line 328, in apply_emb
    V = E(sparse_index_group_batch, sparse_offset_group_batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/dlrm/tt_embeddings_ops.py", line 801, in forward
    output = TTLookupFunction.apply(
  File "/mnt/dlrm/tt_embeddings_ops.py", line 174, in forward
    output = tt_embeddings.tt_forward(
RuntimeError: CUDA error: an illegal memory access was encountered

1) I'm thinking this is caused by in correct parameters and wondering if anyone could help me out here. 2) I'm also wondering if there's any additional changes need to be made to dlrm other than replacing the embeddingbag.

Thanks!

bilgeacun commented 3 years ago

Hi @TimJZ,

1) In DLRM, you can decompose an embedding table of size 9994222 x 64 as follows: 9994222 < 200 200 250 (tt_p_shapes) 64 = 4 4 4 (tt_q_shapes)

Hence the shape of the three tensor cores would be: (1, 200, 4, R1), (R1, 200, 4, R2), and (R2, 250, 4, 1)

Specifying tt_p_shapes and tt_q_shapes is optional as the library will find automatic values for this. You need to specify R1 & R2 values as tt_ranks. In the paper, we set the ranks as 8, 16, 32, 64 one at a time (with R1=R2), i.e. [8,8] or [16,16] etc. Can you try these arguments?

2) No additional changes is needed.

TimJZ commented 3 years ago

Thank you very much for your response! I've tried it with the parameters you mentioned but I'm still getting the same error. I'm thinking this is a version-specific error. Could you please tell me which version of DLRM were you using when testing TT-Embedding? Thanks!

bilgeacun commented 3 years ago

For the latest version of @facebookresearch/DLRM (1302c71624fa9dbe7f0c75fea719d5e58d33e059), this patch made it work for me:

+from tt_embeddings_ops import TTEmbeddingBag
+
 # from torchviz import make_dot
 # import torch.nn.functional as Functional
 # from torch.nn.parameter import Parameter
@@ -243,7 +247,14 @@ class DLRM_Net(nn.Module):
             n = ln[i]

             # construct embedding operator
-            if self.qr_flag and n > self.qr_threshold:
+            if True:
+                EE = TTEmbeddingBag(n, m, [8,8],
+                        None, None,
+                        sparse=False,
+                        weight_dist="approx-normal",
+                        use_cache=False)
+            # construct embedding operator
+            elif self.qr_flag and n > self.qr_threshold:
                 EE = QREmbeddingBag(
                     n,
                     m,
@@ -407,14 +418,24 @@ class DLRM_Net(nn.Module):
             # We are using EmbeddingBag, which implicitly uses sum operator.
             # The embeddings are represented as tall matrices, with sum
             # happening vertically across 0 axis, resulting in a row vector
-            # E = emb_l[k]
+            E = emb_l[k]

             if v_W_l[k] is not None:
                 per_sample_weights = v_W_l[k].gather(0, sparse_index_group_batch)
             else:
                 per_sample_weights = None

-            if self.quantize_emb:
+            if (isinstance(E, TTEmbeddingBag)):
+                l = sparse_index_group_batch.shape[0]
+                ll = torch.empty(1, dtype=torch.long)
+                ll[0]=l
+                if (sparse_offset_group_batch.is_cuda):
+                    ll = ll.to(torch.device("cuda"))
+                sparse_offset = torch.cat((sparse_offset_group_batch, ll), dim=0)
+
+                V = E(sparse_index_group_batch,sparse_offset)
+                ly.append(V)
+            elif self.quantize_emb:
                 s1 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement()
                 s2 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement()
                 print("quantized emb sizes:", s1, s2)

And I ran it with a command like this: python dlrm_s_pytorch.py --use-gpu --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./input/train.txt --processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.05 --mini-batch-size=128 --print-freq=1024 --print-time --test-freq=102400 --test-num-workers=16

Note that this makes all embeddings TTEmbedding, you can make some of them TT by changing the if True statement above. Could you try this and see if it works for you?

latifisalar commented 3 years ago

Hi, I have been facing the "RuntimeError: CUDA error: an illegal memory access was encountered" error as well. I am trying to train the DLRM model on Terabyte dataset. I have made the changes as you mentioned in the previous post, but I am facing the same error in sequentional_forward function.

I have tried it with PyTorch==1.8.0 with cuda 11.0 and 10.2, both of them have resulted in the same error:

Traceback (most recent call last): File "dlrm_s_pytorch_ttemb.py", line 1891, in <module> run() File "dlrm_s_pytorch_ttemb.py", line 1570, in run ndevices=ndevices, File "dlrm_s_pytorch_ttemb.py", line 142, in dlrm_wrap return dlrm(X.to(device), lS_o, lS_i) File "/home/salar/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "dlrm_s_pytorch_ttemb.py", line 529, in forward return self.sequential_forward(dense_x, lS_o, lS_i) File "dlrm_s_pytorch_ttemb.py", line 601, in sequential_forward ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l) File "dlrm_s_pytorch_ttemb.py", line 430, in apply_emb ll = ll.to(d) RuntimeError: CUDA error: an illegal memory access was encountered done

Here is the command I used: python3 dlrm_s_pytorch_ttemb.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --processed-data-file=/data4/salar/terabyte/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --print-freq=1024 --print-time --test-mini-batch-size=4096 --test-num-workers=16 --use-gpu --test-freq=10240 --memory-map --data-sub-sample-rate=0.875 --raw-data-file=/data4/salar/terabyte/day --mini-batch-size=2048 --mlperf-logging

I was printing the content of sparse_index_group_batch, sparse_offset_group_batch, and embedding output to see what could be the possible issue, I observed that error occurs when the current batch has lots of zero values in the sparse_index_group_batch tensor. Not sure if related, but wanted to mention it in case it helps.

I would really appreciate if you could help me find out what could be the possible issue.

Thanks

latifisalar commented 3 years ago

Update: The illegal memory access issue was being caused by smaller embedding tables which had very low number of entries. By applying TTEmbedding only to bigger embeddings, I was able to complete apply_emb function. However, I am now facing a new issue when calling the torch.cuda.synchronize() function: Traceback (most recent call last): File "dlrm_s_pytorch_ttemb.py", line 1900, in <module> run() File "dlrm_s_pytorch_ttemb.py", line 1549, in run current_time = time_wrap(use_gpu) File "dlrm_s_pytorch_ttemb.py", line 122, in time_wrap torch.cuda.synchronize() File "/home/salar/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 402, in synchronize return torch._C._cuda_synchronize() RuntimeError: CUDA error: device-side assert triggered

Would it be possible to redirect me on how to reproduce the exact results included in the paper for the Terabyte dataset?

Thanks

TimJZ commented 3 years ago

For the latest version of @facebookresearch/DLRM (1302c71624fa9dbe7f0c75fea719d5e58d33e059), this patch made it work for me:

+from tt_embeddings_ops import TTEmbeddingBag
+
 # from torchviz import make_dot
 # import torch.nn.functional as Functional
 # from torch.nn.parameter import Parameter
@@ -243,7 +247,14 @@ class DLRM_Net(nn.Module):
             n = ln[i]

             # construct embedding operator
-            if self.qr_flag and n > self.qr_threshold:
+            if True:
+                EE = TTEmbeddingBag(n, m, [8,8],
+                        None, None,
+                        sparse=False,
+                        weight_dist="approx-normal",
+                        use_cache=False)
+            # construct embedding operator
+            elif self.qr_flag and n > self.qr_threshold:
                 EE = QREmbeddingBag(
                     n,
                     m,
@@ -407,14 +418,24 @@ class DLRM_Net(nn.Module):
             # We are using EmbeddingBag, which implicitly uses sum operator.
             # The embeddings are represented as tall matrices, with sum
             # happening vertically across 0 axis, resulting in a row vector
-            # E = emb_l[k]
+            E = emb_l[k]

             if v_W_l[k] is not None:
                 per_sample_weights = v_W_l[k].gather(0, sparse_index_group_batch)
             else:
                 per_sample_weights = None

-            if self.quantize_emb:
+            if (isinstance(E, TTEmbeddingBag)):
+                l = sparse_index_group_batch.shape[0]
+                ll = torch.empty(1, dtype=torch.long)
+                ll[0]=l
+                if (sparse_offset_group_batch.is_cuda):
+                    ll = ll.to(torch.device("cuda"))
+                sparse_offset = torch.cat((sparse_offset_group_batch, ll), dim=0)
+
+                V = E(sparse_index_group_batch,sparse_offset)
+                ly.append(V)
+            elif self.quantize_emb:
                 s1 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement()
                 s2 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement()
                 print("quantized emb sizes:", s1, s2)

And I ran it with a command like this: python dlrm_s_pytorch.py --use-gpu --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./input/train.txt --processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.05 --mini-batch-size=128 --print-freq=1024 --print-time --test-freq=102400 --test-num-workers=16

Note that this makes all embeddings TTEmbedding, you can make some of them TT by changing the if True statement above. Could you try this and see if it works for you?

Thank you very much for your reply! I've tried it with the patch applied and got the following error:

Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 1887, in <module>
    run()
  File "dlrm_s_pytorch.py", line 1566, in run
    ndevices=ndevices,
  File "dlrm_s_pytorch.py", line 138, in dlrm_wrap
    return dlrm(X.to(device), lS_o, lS_i)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "dlrm_s_pytorch.py", line 534, in forward
    return self.parallel_forward(dense_x, lS_o, lS_i)
  File "dlrm_s_pytorch.py", line 688, in parallel_forward
    ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l)
  File "dlrm_s_pytorch.py", line 432, in apply_emb
    sparse_offset = torch.cat((sparse_offset_group_batch, ll), dim=0)
RuntimeError: All input tensors must be on the same device. Received cuda:1 and cuda:0

Could you please give me some insights on what might go wrong?

I'm using pytorch=1.6.0a0+9907a3e and cuda =11.0.167

Since in pytorch 1.6, there's no approriate API for float gpuAtomicAdd(&cache_optimizer_state[idx], g_avg_square) (line 1711 in tt_embeddings_cuda.cu), I've also used the void version of the function and assigned the calculated value to old_sum_square_grads after function call. I don't think this is the source of error though.

Thanks!

bilgeacun commented 3 years ago

Received cuda:1 and cuda:0

@TimJZ it looks like you are using two devices. We only tested DRLM on a single GPU so far, it should fit on a single device with 16GB memory when training DLRM with Terabyte and Kaggle datasets. Can you try running on a single device (i.e. by setting export CUDA_VISIBLE_DEVICES=0)?

TimJZ commented 3 years ago

Received cuda:1 and cuda:0

@TimJZ it looks like you are using two devices. We only tested DRLM on a single GPU so far, it should fit on a single device with 16GB memory when training DLRM with Terabyte and Kaggle datasets. Can you try running on a single device (i.e. by setting export CUDA_VISIBLE_DEVICES=0)?

I've tried it on single GPU, but I'm constantly getting illegal memory access error after the for loop runs 6 times:

Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 1888, in <module>
    run()
  File "dlrm_s_pytorch.py", line 1567, in run
    ndevices=ndevices,
  File "dlrm_s_pytorch.py", line 138, in dlrm_wrap
    return dlrm(X.to(device), lS_o, lS_i)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "dlrm_s_pytorch.py", line 532, in forward
    return self.sequential_forward(dense_x, lS_o, lS_i)
  File "dlrm_s_pytorch.py", line 604, in sequential_forward
    ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l)
  File "dlrm_s_pytorch.py", line 431, in apply_emb
    ll = ll.to(torch.device("cuda"))
RuntimeError: CUDA error: an illegal memory access was encountered

The GPU I'm using is Tesla V100-SXM2 with 32 GB of memory

latifisalar commented 3 years ago

If you have —mlperf-logging in your arguments, remove it. I was facing the same issue and it seems to being caused by enabling MLPerf logging.

TimJZ commented 3 years ago

If you have —mlperf-logging in your arguments, remove it. I was facing the same issue and it seems to being caused by enabling MLPerf logging.

I actually did not use mlperf-logging, but thanks for the feedback! I'm wondering if it's because I was using the mlperf-binloader.

TimJZ commented 3 years ago

@latifisalar @bilgeacun I'm wondering if guys have any update with regard to this issue? Thanks!

facebookresearch / FBTT-Embedding

Correct Integration of TT-Embedding to DLRM #10