What is different between max_tt_rank and tt_rank?

miladdona commented 3 years ago

Hi guys,

I have a simple model and I want to apply T3F library on a dense layer of the model with shape of (4536, 100). There are different combination but I want to use this [[2, 2, 2, 567], [2, 2, 5, 5]] and define rank as 10.

Wtt = t3f.to_tt_matrix(W, shape=[[2, 2, 2, 567], [2, 2, 5, 5]], max_tt_rank=10) tt_layer = t3f.nn.KerasDense(input_dims=[2, 2, 2, 567], output_dims=[2, 2, 5, 5], tt_rank=10, activation='relu')

But after running I get this error: ValueError: Layer weight shape (1, 2, 2, 20) not compatible with provided weight shape (1, 2, 2, 4)

I think this is related to the max_tt_rank in the first statement and tt_rank in the second statement. I want to know what is different between them and how can I control this?

Thanks.

Bihaqo commented 3 years ago

Hi

A few things

1) I had reasons to call max_tt_rank and tt_rank differently, but now that you questioned it, I realised that those reasons were never convincing enough and you're totally right, they should have the same name (tt_rank) 2) You hit a frequent problem common to many TT codebases when your TT rank is bigger than the theoretical maximally useful TT-rank. TT-rank is actually a list, when you define it with a number 10 it gets silently converted into list (1, 10, 10, 10, 1) for you (the list has 5 elements because your underlying tensor is 4 dimensional; it aways have 1 as the first and last element). The second of those TT-ranks is redundantly big. You can change the code to

tt_layer = t3f.nn.KerasDense(input_dims=[2, 2, 2, 567], output_dims=[2, 2, 5, 5], tt_rank=(1, 4, 10, 10, 1), activation='relu')

and I believe it should work 3) Actually, I wouldn't recommend using such an inbalanced tensor shape. Very likely you would be better off to pad your input size 4536 to e.g. 5000 and then use input_dims = (10, 10, 10, 5) or something like this. This would also fix you previous problem: with a more balanced shape, the TT-rank 10 should work out of the box. 4) Also note that TT-layer might be sensitive to the order of inputs and outputs, i.e. it might work a lot worse if you shuffle your output dimensions. It is not a problem if the layer is in the middle of an MLP (because the surrounding dense layers can provide features in any order that is useful for your TT-layer), but it might be problematic if using the TT-layer as the last layer, since the order of outputs would be defined by the (arbitrary) order of your labels. TLDR: if this is the last layer in your network, I would try to also add yet another dense layer on top of it of size 100 x 100.

miladdona commented 3 years ago

Thanks. Is there a way to find list of tt_rank? I mean how did find the tt_rank=(1, 4, 10, 10, 1)? Did you try with running or did you find it with some equations and relations?

Bihaqo commented 3 years ago

So the idea is that if your input dims are [a1, a2, a3] and your output dims are [b1, b2, b3], then your TT-ranks should be smaller than np.minimum([1, a1*b1, a1*b1*a2*b2, a1*b1*a2*b2*a3*b3], [a1*b1*a2*b2*a3*b3, a2*b2*a3*b3, a3*b3, 1]).

In this case it's np.minimum([1, 4, 16, 160, 453600], [453600, 113400, 28350, 2835, 1]) = [1, 4, 16, 160, 1].

Bihaqo / t3f

What is different between max_tt_rank and tt_rank? #218