Open igor17400 opened 3 months ago
Hi, I noticed that if the learning rate is too big, there might be nan
values returned in the forward pass. However, I ran the NRMS
model on both MINDlarge and Adressa 1-2 weeks ago, and I didn't have any issues myself.
Someone else faced a similar problem with an earlier version of the code and the MANNeR
model here
Thank you for your response @andreeaiana!
I just finished collecting the prints I mentioned before. I added them inside the validation_step
as follows:
def validation_step(self, batch: RecommendationBatch, batch_idx: int):
loss, preds, targets, cand_news_size, _, _, _, _, _, _, _ = self.model_step(batch)
print("********* loss *********")
print(loss.size())
print(loss)
print("********* preds *********")
print(preds.size())
print(preds)
print("********* targets *********")
print(targets.size())
print(targets)
I tested for the NRMS
model. After 2 epochs, during the validation part of the 2nd epoch it outputs the same error message:
IndexError: index 4607182419821563448 is out of bounds for dimension 0 with size 36
And as I suspected, indeed the tensors values are coming nan
. As it can be seen below:
Regarding the learning rate you mentioned, I'm using the default configuration of the file nrms.yaml
. That is,
optimizer:
_target_: torch.optim.Adam
_partial_: true
lr: 0.0001
I'll add some print statements during the training part of epoch 2 to see what happens with the preds
and loss
Apparently some predictions are being returned as nan
. I've added the print statements inside model_step
method to better visualized during training. Here is one example:
########### targets ###########
torch.Size([50])
tensor([0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0.,
0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.,
1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.],
########### batch[batch_cand] ###########
torch.Size([50])
tensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7,
7, 7], device='cuda:0')
########### user_ids ###########
tensor([38572, 68013, 60944, 5008, 58917, 88430, 30919, 56731],
device='cuda:0')
tensor([38572, 68013, 60944, 5008, 58917, 88430, 30919, 56731],
device='cuda:0')
########### cand_news_ids ###########
torch.Size([50])
tensor([31602, 62391, 50135, 59893, 38783, 50675, 24423, 62360, 24111, 49180,
48019, 63970, 33619, 48046, 32544, 44422, 38263, 44290, 7419, 62563,
43102, 20678, 33885, 58114, 30172, 51398, 27845, 39115, 25764, 41178,
34876, 59673, 51048, 287, 45266, 55689, 35729, 55689, 59981, 7809,
32544, 7319, 41020, 50675, 31947, 43432, 43432, 43432, 43432, 55689],
device='cuda:0')
########### loss ###########
torch.Size([])
tensor(nan, device='cuda:0', grad_fn=<DivBackward1>)
########### preds ###########
torch.Size([80])
tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0',
dtype=torch.float16, grad_fn=<CatBackward0>)
########### y_true ###########
torch.Size([8, 5])
tensor([[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 1., 0., 0.]], device='cuda:0')
########### targets ###########
torch.Size([40])
tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0.,
0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0.,
0., 1., 0., 0.], device='cuda:0')
########### batch[batch_cand] ###########
torch.Size([40])
tensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7], device='cuda:0')
########### user_ids ###########
tensor([28439, 40880, 54316, 40815, 32693, 44860, 23931, 14864],
device='cuda:0')
tensor([28439, 40880, 54316, 40815, 32693, 44860, 23931, 14864],
device='cuda:0')
########### cand_news_ids ###########
torch.Size([40])
tensor([35729, 59981, 59981, 59981, 59981, 40839, 40839, 58363, 40839, 64851,
40318, 32791, 48759, 63550, 13801, 12042, 56598, 35729, 35172, 64542,
13930, 45270, 55204, 13930, 55689, 57651, 57651, 49685, 57651, 57651,
37660, 22417, 14029, 17117, 36261, 8643, 23508, 63958, 64968, 10913],
device='cuda:0')
I'm struggling to understand why this is happening 🧐
The same error happened with MINS
model as well on the 2nd epoch.
I found this issue regarding nn.MultiheadAttention
. Maybe that's the problem?
@andreeaiana, could you please check if our library versions match? Alternatively, if you want to send me your list of versions, and I can compare them, whatever its best.
Result for python --version
: Python 3.9.19
Result for conda list > package_list.txt
: packages_list.txt
@igor17400 sure, I apologize for the slow reply
python --version
: Python 3.9.16
conda list > package_list.txt
: package_list.txt@andreeaiana no worries! Thank you very much, I'll let you know of any progress.
@andreeaiana here are some findings until now.
NRMS
As I mentioned before, apparently there is some bug with nn.MultiheadAttention
. I then decided to make a test and create a new module called custom_transformer.py
following the comment .
You can better understand what I did by looking at this file https://github.com/andreeaiana/newsreclib/blob/b7e3357b9247a2efbb57766b5383b8a442d3f531/newsreclib/models/components/encoders/user/nrms.py
After that modification, I was able to successfully train NRMS
for all epochs (10 in total), however I don't know if this was a coincidence.
MINER
This case is a bit different because it isn't using nn.MultiheadAttention
and even so is still computing nan
vectors. Let me show what I found until now.
Apparently the news_encoder
is returning a vector of nan
values. I added some print statements to all variables in the forward
function as shown below:
if torch.isnan(scores).any():
print("******* forward *********")
print("------- batch[x_hist] --------")
print(batch["x_hist"])
print("------- hist_news_vector --------")
print(hist_news_vector.size())
print(hist_news_vector)
print("------- batch[x_cand] --------")
print(batch["x_cand"])
print("------- cand_news_vector --------")
print(cand_news_vector.size())
print(cand_news_vector)
print("------- hist_categ_vector --------")
print(hist_categ_vector.size())
print(hist_categ_vector)
print("------- cand_categ_vector --------")
print(cand_categ_vector.size())
print(cand_categ_vector)
print("------- categ_bias_agg --------")
print(categ_bias_agg.size())
print(categ_bias_agg)
print("------- user_vector --------")
print(user_vector.size())
print(user_vector)
print("------- scores --------")
print(scores.size())
print(scores)
print("**********************************")
In addition, I would just like to highlight the following code logic:
hist_news_vector = self.news_encoder(batch["x_hist"])
hist_news_vector_agg, mask_hist = to_dense_batch(
hist_news_vector, batch["batch_hist"]
)
Then, at some point I received the warning: UserWarning: Encountered nan values in tensor. Will be removed. warnings.warn(*args, **kwargs) # noqa: B028
And I checked the prints as shown below:
------- batch[x_hist] --------
{'news_ids': tensor([30160, 24917, 30680, 10359, 36312, 21685, 57967, 24374, 40163, 17968,
29276, 61055, 31599, 33203, 62931, 41777, 17825, 19769, 5642, 59546,
7158, 51942, 54624, 51221, 63049, 477, 26799, 46866, 30727, 3259,
52551, 46795, 37509, 36754, 27922, 27140, 2735, 53494, 1267, 15253,
36053, 4166, 10919, 50635, 43142, 43623, 54469, 22570, 6523, 23571,
21977, 33707, 45729, 10059, 41997, 64408, 4593, 40716, 250, 5978,
63229, 9101, 63123, 42274, 16781, 51667, 35601, 34930, 50, 46811,
20344, 24691, 15253, 58030, 24298, 60991, 25450, 14349, 10470, 46039,
29730, 719, 2203, 31191, 20216, 16233, 6233, 64503, 9653, 17799,
30974, 42281, 46513, 44396, 5978, 13925, 40716, 23653, 9803, 60184,
61342, 42620, 46267, 52551, 62058, 23958, 28257, 15676],
device='cuda:0'), 'title': {'input_ids': tensor([[ 0, 6179, 598, ..., 1, 1, 1],
[ 0, 2264, 294, ..., 1, 1, 1],
[ 0, 1749, 9899, ..., 1, 1, 1],
...,
Look at how the hist_news_vector
is being printed:
------- hist_news_vector --------
torch.Size([108, 256])
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',
dtype=torch.float16, grad_fn=<NativeDropoutBackward0>)
Apparently there is something going on with the user_encoder
, but in the case of MINER
it's using the PolyAttention
module. So I'm wondering why is this happening.
MANNeR
I just noticed that it uses MHSAAddAtt
which then uses nn.MultiheadAttention
, I'll check if making a replacement similar to what was done in NRMS
solves the bug.
Apparently the error, at least in the case of MINS
, is coming from the news_encoder
when passing the input_ids
and attention_mask
to the PLM model. In the case I'm testing it's roberta-base
.
The prints I added were the following:
if self.encode_text:
text_vectors = [
encoder(news[name]) for name, encoder in self.text_encoders.items()
]
if torch.isnan(text_vectors[0]).any() or torch.isnan(text_vectors[1]).any():
print("@@@@@@ self.text_encoders.items() @@@@@@")
print(self.text_encoders.items())
print("********")
for name, encoder in self.text_encoders.items():
print("------- encoder --------")
print(encoder)
print("------- name --------")
print(name)
print("------- news[name] --------")
print(news[name].keys())
print(news[name])
torch.save(news[name]["input_ids"], f"{name}_input_ids.pth")
torch.save(
news[name]["attention_mask"], f"{name}_attention_mask.pth"
)
print("------- encoder(news[name]) --------")
print(encoder(news[name]))
print("---------------")
print("********")
Here is one example of nan
during the encoding:
********
------- encoder --------
PLM(
(plm_model): RobertaModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(50265, 768, padding_idx=1)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): RobertaEncoder(
(layer): ModuleList(
(0-11): 12 x RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): RobertaPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(multihead_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(additive_attention): AdditiveAttention(
(linear): Linear(in_features=768, out_features=200, bias=True)
)
(dropout): Dropout(p=0.2, inplace=False)
)
------- name --------
abstract
------- news[name] --------
dict_keys(['input_ids', 'attention_mask'])
{'input_ids': tensor([[ 0, 37545, 2839, ..., 1, 1, 1],
[ 0, 970, 18, ..., 1, 1, 1],
[ 0, 133, 15091, ..., 1, 1, 1],
...,
[ 0, 133, 5474, ..., 1, 1, 1],
[ 0, 16035, 48032, ..., 1, 1, 1],
[ 0, 20861, 4121, ..., 1, 1, 1]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]], device='cuda:0')}
------- encoder(news[name]) --------
tensor([[-0.0431, 0.2167, 0.0773, ..., 0.2761, 0.1248, -0.0325],
[-0.0584, 0.2065, 0.0724, ..., 0.2703, 0.1244, -0.0341],
[-0.1138, 0.2350, 0.0594, ..., 0.3142, 0.1550, -0.0509],
...,
[-0.0830, 0.2133, 0.0742, ..., 0.2913, 0.1406, -0.0444],
[-0.0589, 0.2003, 0.0781, ..., 0.3198, 0.1290, -0.0375],
[-0.0709, 0.2073, 0.0741, ..., 0.3025, 0.1254, -0.0388]],
device='cuda:0', dtype=torch.float16, grad_fn=<SqueezeBackward1>)
---------------
------- encoder --------
PLM(
(plm_model): RobertaModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(50265, 768, padding_idx=1)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): RobertaEncoder(
(layer): ModuleList(
(0-11): 12 x RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): RobertaPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(multihead_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(additive_attention): AdditiveAttention(
(linear): Linear(in_features=768, out_features=200, bias=True)
)
(dropout): Dropout(p=0.2, inplace=False)
)
------- name --------
title
------- news[name] --------
dict_keys(['input_ids', 'attention_mask'])
{'input_ids': tensor([[ 0, 37545, 2839, ..., 1, 1, 1],
[ 0, 7608, 5105, ..., 1, 1, 1],
[ 0, 510, 43992, ..., 1, 1, 1],
...,
[ 0, 673, 10188, ..., 1, 1, 1],
[ 0, 16035, 48032, ..., 1, 1, 1],
[ 0, 6407, 740, ..., 1, 1, 1]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]], device='cuda:0')}
------- encoder(news[name]) --------
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',
dtype=torch.float16, grad_fn=<SqueezeBackward1>)
---------------
********
As it can be seen, the encoding for the title
is returning an all nan
vector. But I don't know why this is happening.
@andreeaiana here are some findings until now.
NRMS
As I mentioned before, apparently there is some bug with
nn.MultiheadAttention
. I then decided to make a test and create a new module calledcustom_transformer.py
following the comment .You can better understand what I did by looking at this file https://github.com/andreeaiana/newsreclib/blob/b7e3357b9247a2efbb57766b5383b8a442d3f531/newsreclib/models/components/encoders/user/nrms.py
After that modification, I was able to successfully train
NRMS
for all epochs (10 in total), however I don't know if this was a coincidence.
MANNeR
I just noticed that it uses
MHSAAddAtt
which then usesnn.MultiheadAttention
, I'll check if making a replacement similar to what was done inNRMS
solves the bug.
I think if your custom_transformer.py
fixes the nn.MultiheadAttention
bug, we can replace the call to nn.MultiheadAttention
with our custom implementation everywhere. I'll try to have a closer look at all the changes towards the end of this week (very sorry about this, super packed schedule at the moment) and try running some of the models again myself with an updated conda environment.
Hi @andreeaiana, sorry for the late response.
I think the substitution to the new module might have been a coincidence, but hard to affirm. I believe this behaviors seems random depending on the samples selected during the epochs. The reason why I say this, is because I didn't identify any pattern to be honest. I'm running the MINS
model, which has previously encountered errors, and until now no error has been thrown 🧐
I just received the error again, even when changing the line https://github.com/andreeaiana/newsreclib/blob/7c21715ea3d03622aabc883224613293c0f88dd2/newsreclib/models/components/encoders/user/mins.py#L44 to the new MultiheadAttention declared locally.
MINS
model:
warnings.warn(*args, **kwargs) # noqa: B028
Epoch 3: 61%|▌| 9397/15529 [44:41<29:10, 3.50it/s, v_num=e96d, val/loss=5.440, val/loss_best=5.430, val/auc=0.468, val/mrr=0.172, val/ndcg@10=0.20
I made some more tests, and I believe the error might be associated with the precision: https://github.com/andreeaiana/newsreclib/blob/7c21715ea3d03622aabc883224613293c0f88dd2/configs/trainer/default.yaml#L12
I was receiving the warning warnings.warn(*args, **kwargs) # noqa: B028
when running MINER
and then I changed to precision: 32
and apparently is working. However it takes longer to train the model. I believe if we use precision: bf16-true
it should fix this issue. However, in the case of MINS
apparently GRU
doesn't accept this precision yet: https://github.com/pytorch/pytorch/issues/116763
Maybe we can then switch to precision: bf16-true
for all the cases where this works. Until this is also supported for GRU
, an intermediary solution would be to train it with precision: 32
in case precision: 16
results in the aforementioned errors.
Hi @andreeaiana, I'm doing some tests to evaluate if this solutions is stable and until now I was able to train NRMS
for 10 epochs with different parameters without any errors using precision: bf16-true
. I'll conduct some tests with other models to see if this solution holds.
In addition I would like to ask you another question if possible, is it possible to use pre-trained embeddings with MANNeR
? I'm asking you this because it seems that the code only supports PLM
.
Hi @andreeaiana, I'm doing some tests to evaluate if this solutions is stable and until now I was able to train
NRMS
for 10 epochs with different parameters without any errors usingprecision: bf16-true
. I'll conduct some tests with other models to see if this solution holds.
Great, thanks a lot!
In addition I would like to ask you another question if possible, is it possible to use pre-trained embeddings with
MANNeR
? I'm asking you this because it seems that the code only supportsPLM
.
Similar to MINER, MANNeR was designed from the start to work with PLMs, so I implemented it as such in NewsRecLib. I think changing the code to work also with pre-trained embeddings should be quite straightforward, but I'm not sure how these would affect the model's performance. I expect that MANNeR would still work well with pre-trained embeddings, but some extra experiments would be needed to confirm this.
Hello, I'm using the bfloat16 with gru ,Have this been solved? and Which version pytorch have the support for the gru bfloat16?
Hi @Lr-2002,
The support of BFloat16
in GRU
is an issue of PyTorch. According to this discusion, if you build pytorch using commit 9504182 it should work
Hi, when executing some models such as
NRMS
andMINER
after a few epochs—2 forNRMS
and 5 forMINER
—I am encountering the following error message:From my investigations, it seems that the loss might be exploding and the forward pass is returning
nan
vectors. For instance, before receiving the error, I receive warning messages like:I am using the default configuration files:
I plan to add some print statements to better evaluate the bug. Has anyone else faced this issue?