EurekaLabsAI / mlp

The Multilayer Perceptron Language Model
504 stars 45 forks source link

Try to beat val loss 2.06 in basic setup #22

Open KarolDuracz opened 2 weeks ago

KarolDuracz commented 2 weeks ago

Hi. Sorry for spam. But I'm trying to understand deeply how it works. And where this loss comes from and why etc. One of the things that improved ML were optimizers and activation functions. At first I tried to look for something in datasets and parameters, random number generator and I looked at std, var, mean in logtits.data and logits.grad. So just to see what the result looks like I changed AdamW to SGD and Tanh to ReLU. I also looked at other functions like RReLU, Hardshrink but the basic ones are SGD and AdamW.

This is just for to see that AdamW is better.

I was wondering if the number of 'EOT' and 'a' in the datasets could have any significance for optimization here data_p

SGD + tanh vs. AdamW + tanh

first_test

SGD + ReLU vs. AdamW + ReLU

second test

Like I said. I put this here, because I want to deep dive into it and learn more about it. Sorry for spam.

KarolDuracz commented 2 weeks ago

I know that's not the point, but I see that someone (https://github.com/EurekaLabsAI/mlp/pull/18) also tried to play with this model. That's why I put this here for comparison.

In line 204 I add this SGD optimizer to do this test (https://github.com/EurekaLabsAI/mlp/blob/master/mlp_pytorch.py).

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=1e-4)
#optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

And in this large picture I tested only context length from 1 to 6. I changed only value in line 193. Other settings is the same as basic setup embedding_size = 48, hidden_size = 512, learning_rate = 7e-4, batch_size = 128, num_steps = 50000. On the top is SDG + tanh and context_length 1-6. On the bottom AdamW + tanh and context_length 1-6. For # 3 there is no data, because this test is on the first post.

context_length = 3 # if 3 tokens predict the 4th, this is a 4-gram model

all_context_length_1_to_6

Then I changed embedding_size = 48 in line 194 to 96 and hidden_size = 512, on the right sight embedding_size = 96 and hidden_size = 1024. So, first I multiplied emb_size x2 then hidden_size x2 to see what's happen. There's only chart for second test, but samples for both.

emb_size 96 and hidden_size change to 1024

These plots are from this data

arr_loss = []
arr_val = []
arr_var = []
arr_std = []
arr_mean = []

arr_grad_var = []
arr_grad_std = []
arr_grad_mean = []

with timer:
        # get the next batch of training data
        inputs, targets = next(train_data_iter)
        # forward pass (calculate the loss)
        logits, loss = model(inputs, targets)
        # backpropagate pass (calculate the gradients)
        loss.backward()

        arr_loss.append(train_loss)
        arr_val.append(val_loss)

        for p in model.parameters():
            arr_var.append(p.data.var().item())
            arr_std.append(p.data.std().item())
            arr_mean.append(p.data.mean().item())
            arr_grad_var.append(p.grad.var().item())
            arr_grad_std.append(p.grad.std().item())
            arr_grad_mean.append(p.grad.mean().item())

Maybe there is something useful in this spam. Regards.