Open KarolDuracz opened 2 weeks ago
I know that's not the point, but I see that someone (https://github.com/EurekaLabsAI/mlp/pull/18) also tried to play with this model. That's why I put this here for comparison.
In line 204 I add this SGD optimizer to do this test (https://github.com/EurekaLabsAI/mlp/blob/master/mlp_pytorch.py).
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=1e-4)
#optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
And in this large picture I tested only context length from 1 to 6. I changed only value in line 193. Other settings is the same as basic setup embedding_size = 48, hidden_size = 512, learning_rate = 7e-4, batch_size = 128, num_steps = 50000. On the top is SDG + tanh and context_length 1-6. On the bottom AdamW + tanh and context_length 1-6. For # 3 there is no data, because this test is on the first post.
context_length = 3 # if 3 tokens predict the 4th, this is a 4-gram model
Then I changed embedding_size = 48 in line 194 to 96 and hidden_size = 512, on the right sight embedding_size = 96 and hidden_size = 1024. So, first I multiplied emb_size x2 then hidden_size x2 to see what's happen. There's only chart for second test, but samples for both.
These plots are from this data
arr_loss = []
arr_val = []
arr_var = []
arr_std = []
arr_mean = []
arr_grad_var = []
arr_grad_std = []
arr_grad_mean = []
with timer:
# get the next batch of training data
inputs, targets = next(train_data_iter)
# forward pass (calculate the loss)
logits, loss = model(inputs, targets)
# backpropagate pass (calculate the gradients)
loss.backward()
arr_loss.append(train_loss)
arr_val.append(val_loss)
for p in model.parameters():
arr_var.append(p.data.var().item())
arr_std.append(p.data.std().item())
arr_mean.append(p.data.mean().item())
arr_grad_var.append(p.grad.var().item())
arr_grad_std.append(p.grad.std().item())
arr_grad_mean.append(p.grad.mean().item())
Maybe there is something useful in this spam. Regards.
Hi. Sorry for spam. But I'm trying to understand deeply how it works. And where this loss comes from and why etc. One of the things that improved ML were optimizers and activation functions. At first I tried to look for something in datasets and parameters, random number generator and I looked at std, var, mean in logtits.data and logits.grad. So just to see what the result looks like I changed AdamW to SGD and Tanh to ReLU. I also looked at other functions like RReLU, Hardshrink but the basic ones are SGD and AdamW.
This is just for to see that AdamW is better.
I was wondering if the number of 'EOT' and 'a' in the datasets could have any significance for optimization here
SGD + tanh vs. AdamW + tanh
SGD + ReLU vs. AdamW + ReLU
Like I said. I put this here, because I want to deep dive into it and learn more about it. Sorry for spam.