Open TimJZ opened 3 years ago
Hi @TimJZ,
1) In DLRM, you can decompose an embedding table of size 9994222 x 64 as follows:
9994222 < 200 200 250 (tt_p_shapes
)
64 = 4 4 4 (tt_q_shapes
)
Hence the shape of the three tensor cores would be: (1, 200, 4, R1), (R1, 200, 4, R2), and (R2, 250, 4, 1)
Specifying tt_p_shapes
and tt_q_shapes
is optional as the library will find automatic values for this. You need to specify R1 & R2 values as tt_ranks
. In the paper, we set the ranks as 8, 16, 32, 64 one at a time (with R1=R2), i.e. [8,8] or [16,16] etc. Can you try these arguments?
2) No additional changes is needed.
Thank you very much for your response! I've tried it with the parameters you mentioned but I'm still getting the same error. I'm thinking this is a version-specific error. Could you please tell me which version of DLRM were you using when testing TT-Embedding? Thanks!
For the latest version of @facebookresearch/DLRM (1302c71624fa9dbe7f0c75fea719d5e58d33e059), this patch made it work for me:
+from tt_embeddings_ops import TTEmbeddingBag
+
# from torchviz import make_dot
# import torch.nn.functional as Functional
# from torch.nn.parameter import Parameter
@@ -243,7 +247,14 @@ class DLRM_Net(nn.Module):
n = ln[i]
# construct embedding operator
- if self.qr_flag and n > self.qr_threshold:
+ if True:
+ EE = TTEmbeddingBag(n, m, [8,8],
+ None, None,
+ sparse=False,
+ weight_dist="approx-normal",
+ use_cache=False)
+ # construct embedding operator
+ elif self.qr_flag and n > self.qr_threshold:
EE = QREmbeddingBag(
n,
m,
@@ -407,14 +418,24 @@ class DLRM_Net(nn.Module):
# We are using EmbeddingBag, which implicitly uses sum operator.
# The embeddings are represented as tall matrices, with sum
# happening vertically across 0 axis, resulting in a row vector
- # E = emb_l[k]
+ E = emb_l[k]
if v_W_l[k] is not None:
per_sample_weights = v_W_l[k].gather(0, sparse_index_group_batch)
else:
per_sample_weights = None
- if self.quantize_emb:
+ if (isinstance(E, TTEmbeddingBag)):
+ l = sparse_index_group_batch.shape[0]
+ ll = torch.empty(1, dtype=torch.long)
+ ll[0]=l
+ if (sparse_offset_group_batch.is_cuda):
+ ll = ll.to(torch.device("cuda"))
+ sparse_offset = torch.cat((sparse_offset_group_batch, ll), dim=0)
+
+ V = E(sparse_index_group_batch,sparse_offset)
+ ly.append(V)
+ elif self.quantize_emb:
s1 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement()
s2 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement()
print("quantized emb sizes:", s1, s2)
And I ran it with a command like this:
python dlrm_s_pytorch.py --use-gpu --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./input/train.txt --processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.05 --mini-batch-size=128 --print-freq=1024 --print-time --test-freq=102400 --test-num-workers=16
Note that this makes all embeddings TTEmbedding, you can make some of them TT by changing the if True
statement above.
Could you try this and see if it works for you?
Hi, I have been facing the "RuntimeError: CUDA error: an illegal memory access was encountered" error as well. I am trying to train the DLRM model on Terabyte dataset. I have made the changes as you mentioned in the previous post, but I am facing the same error in sequentional_forward function.
I have tried it with PyTorch==1.8.0 with cuda 11.0 and 10.2, both of them have resulted in the same error:
Traceback (most recent call last): File "dlrm_s_pytorch_ttemb.py", line 1891, in <module> run() File "dlrm_s_pytorch_ttemb.py", line 1570, in run ndevices=ndevices, File "dlrm_s_pytorch_ttemb.py", line 142, in dlrm_wrap return dlrm(X.to(device), lS_o, lS_i) File "/home/salar/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "dlrm_s_pytorch_ttemb.py", line 529, in forward return self.sequential_forward(dense_x, lS_o, lS_i) File "dlrm_s_pytorch_ttemb.py", line 601, in sequential_forward ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l) File "dlrm_s_pytorch_ttemb.py", line 430, in apply_emb ll = ll.to(d) RuntimeError: CUDA error: an illegal memory access was encountered done
Here is the command I used:
python3 dlrm_s_pytorch_ttemb.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --processed-data-file=/data4/salar/terabyte/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --print-freq=1024 --print-time --test-mini-batch-size=4096 --test-num-workers=16 --use-gpu --test-freq=10240 --memory-map --data-sub-sample-rate=0.875 --raw-data-file=/data4/salar/terabyte/day --mini-batch-size=2048 --mlperf-logging
I was printing the content of sparse_index_group_batch, sparse_offset_group_batch, and embedding output to see what could be the possible issue, I observed that error occurs when the current batch has lots of zero values in the sparse_index_group_batch tensor. Not sure if related, but wanted to mention it in case it helps.
I would really appreciate if you could help me find out what could be the possible issue.
Thanks
Update: The illegal memory access issue was being caused by smaller embedding tables which had very low number of entries. By applying TTEmbedding only to bigger embeddings, I was able to complete apply_emb function. However, I am now facing a new issue when calling the torch.cuda.synchronize() function:
Traceback (most recent call last): File "dlrm_s_pytorch_ttemb.py", line 1900, in <module> run() File "dlrm_s_pytorch_ttemb.py", line 1549, in run current_time = time_wrap(use_gpu) File "dlrm_s_pytorch_ttemb.py", line 122, in time_wrap torch.cuda.synchronize() File "/home/salar/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 402, in synchronize return torch._C._cuda_synchronize() RuntimeError: CUDA error: device-side assert triggered
Would it be possible to redirect me on how to reproduce the exact results included in the paper for the Terabyte dataset?
Thanks
For the latest version of @facebookresearch/DLRM (1302c71624fa9dbe7f0c75fea719d5e58d33e059), this patch made it work for me:
+from tt_embeddings_ops import TTEmbeddingBag + # from torchviz import make_dot # import torch.nn.functional as Functional # from torch.nn.parameter import Parameter @@ -243,7 +247,14 @@ class DLRM_Net(nn.Module): n = ln[i] # construct embedding operator - if self.qr_flag and n > self.qr_threshold: + if True: + EE = TTEmbeddingBag(n, m, [8,8], + None, None, + sparse=False, + weight_dist="approx-normal", + use_cache=False) + # construct embedding operator + elif self.qr_flag and n > self.qr_threshold: EE = QREmbeddingBag( n, m, @@ -407,14 +418,24 @@ class DLRM_Net(nn.Module): # We are using EmbeddingBag, which implicitly uses sum operator. # The embeddings are represented as tall matrices, with sum # happening vertically across 0 axis, resulting in a row vector - # E = emb_l[k] + E = emb_l[k] if v_W_l[k] is not None: per_sample_weights = v_W_l[k].gather(0, sparse_index_group_batch) else: per_sample_weights = None - if self.quantize_emb: + if (isinstance(E, TTEmbeddingBag)): + l = sparse_index_group_batch.shape[0] + ll = torch.empty(1, dtype=torch.long) + ll[0]=l + if (sparse_offset_group_batch.is_cuda): + ll = ll.to(torch.device("cuda")) + sparse_offset = torch.cat((sparse_offset_group_batch, ll), dim=0) + + V = E(sparse_index_group_batch,sparse_offset) + ly.append(V) + elif self.quantize_emb: s1 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement() s2 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement() print("quantized emb sizes:", s1, s2)
And I ran it with a command like this:
python dlrm_s_pytorch.py --use-gpu --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./input/train.txt --processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.05 --mini-batch-size=128 --print-freq=1024 --print-time --test-freq=102400 --test-num-workers=16
Note that this makes all embeddings TTEmbedding, you can make some of them TT by changing the
if True
statement above. Could you try this and see if it works for you?
Thank you very much for your reply! I've tried it with the patch applied and got the following error:
Traceback (most recent call last):
File "dlrm_s_pytorch.py", line 1887, in <module>
run()
File "dlrm_s_pytorch.py", line 1566, in run
ndevices=ndevices,
File "dlrm_s_pytorch.py", line 138, in dlrm_wrap
return dlrm(X.to(device), lS_o, lS_i)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
result = self.forward(*input, **kwargs)
File "dlrm_s_pytorch.py", line 534, in forward
return self.parallel_forward(dense_x, lS_o, lS_i)
File "dlrm_s_pytorch.py", line 688, in parallel_forward
ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l)
File "dlrm_s_pytorch.py", line 432, in apply_emb
sparse_offset = torch.cat((sparse_offset_group_batch, ll), dim=0)
RuntimeError: All input tensors must be on the same device. Received cuda:1 and cuda:0
Could you please give me some insights on what might go wrong?
I'm using pytorch=1.6.0a0+9907a3e and cuda =11.0.167
Since in pytorch 1.6, there's no approriate API for float gpuAtomicAdd(&cache_optimizer_state[idx], g_avg_square)
(line 1711 in tt_embeddings_cuda.cu), I've also used the void version of the function and assigned the calculated value to old_sum_square_grads after function call. I don't think this is the source of error though.
Thanks!
Received cuda:1 and cuda:0
@TimJZ it looks like you are using two devices. We only tested DRLM on a single GPU so far, it should fit on a single device with 16GB memory when training DLRM with Terabyte and Kaggle datasets. Can you try running on a single device (i.e. by setting export CUDA_VISIBLE_DEVICES=0
)?
Received cuda:1 and cuda:0
@TimJZ it looks like you are using two devices. We only tested DRLM on a single GPU so far, it should fit on a single device with 16GB memory when training DLRM with Terabyte and Kaggle datasets. Can you try running on a single device (i.e. by setting
export CUDA_VISIBLE_DEVICES=0
)?
I've tried it on single GPU, but I'm constantly getting illegal memory access error after the for loop runs 6 times:
Traceback (most recent call last):
File "dlrm_s_pytorch.py", line 1888, in <module>
run()
File "dlrm_s_pytorch.py", line 1567, in run
ndevices=ndevices,
File "dlrm_s_pytorch.py", line 138, in dlrm_wrap
return dlrm(X.to(device), lS_o, lS_i)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
result = self.forward(*input, **kwargs)
File "dlrm_s_pytorch.py", line 532, in forward
return self.sequential_forward(dense_x, lS_o, lS_i)
File "dlrm_s_pytorch.py", line 604, in sequential_forward
ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l)
File "dlrm_s_pytorch.py", line 431, in apply_emb
ll = ll.to(torch.device("cuda"))
RuntimeError: CUDA error: an illegal memory access was encountered
The GPU I'm using is Tesla V100-SXM2 with 32 GB of memory
If you have —mlperf-logging in your arguments, remove it. I was facing the same issue and it seems to being caused by enabling MLPerf logging.
If you have —mlperf-logging in your arguments, remove it. I was facing the same issue and it seems to being caused by enabling MLPerf logging.
I actually did not use mlperf-logging, but thanks for the feedback! I'm wondering if it's because I was using the mlperf-binloader.
@latifisalar @bilgeacun I'm wondering if guys have any update with regard to this issue? Thanks!
Hi, I'm currently trying to integrate the TT-Embedding to the original DLRM code base, and I've successfully reproduced the result shown in readme. However, I'm not quite sure what are the essential changes to make.
Right now I'm replacing the original embeddingbag function (within the create_emb in dlrm_s_pytorch.py file) in DLRM with TTEmbeddingBag, but have trouble figuring out the correct parameters for the function. The parameters I used right now is:
I left the tt_p_shapes and tt_q_shapes to blank since each layer's embedding dimension and number of embeddings are different. The paper mentioned that the TT-Rank used was [8, 16, 32, 64], but I wasn't able to use that parameter, since it would result in failure of passing assertion
len(self.tt_p_shapes) <= 4
. Therefore I used the same parameters in example ([12,14]).And that result a CUDA illegal memory access error at line 174 in tt_embedding_ops. Full error message is attached below:
1) I'm thinking this is caused by in correct parameters and wondering if anyone could help me out here. 2) I'm also wondering if there's any additional changes need to be made to dlrm other than replacing the embeddingbag.
Thanks!