Open andreaschandra opened 3 years ago
num_vocab = 8479 emb_size = 768 hid_size = 512 num_layers = 1
TextEncoder = nn.Sequential(
nn.Embedding(num_vocab, emb_size),
nn.LSTM(emb_size, hid_size, num_layers=num_layers, batch_first=True)
)
out_dim = [batch_size, seq_length, hid_size]
TextEncoder output torch.Size([2, 14, 512])
the config gets Trainable parameters 9,135,104
@alamhanz
Are we still use Keras?
The output of the MobileNetV2 in Keras will be [None, 7, 7, 1280]
where [batch_size, img_legth, img_width, channel]
My suggestions :
[None, 14,512]
as output,. how about reshaping it into [None, 1, 14, 512]
[None, 4, 4, 256]
, then I reshape it so the output can be [None, 1, 16, 256]
multi_modal_network
can be [None, 1, 16, 512+256]
--> [None, 1, 16, 768]
what do you think?
@andreaschandra
anw..
Trainable params: 7,467,008
I have 2 options actually.. option 1:
and option 2:
or maybe instead of [None, 1, 16, 256]
you want to keep it into [None, 1, 16, 512]
??
Ahh @alamhanz I will rewrite the image model to pytorch
[None, 1, 14, 512]
shape would be equal in image like [batch, height, width, channel]
I first thought that the output image model would be [batch, channel, height, width]
. So that the text has 1 channel, and height would be equal to sequence length, and hidden size would be width image@alamhanz oh no! issue 1 didn't work or if you want to make the channel become 1
if you permute
the channel to axis 1
I can repeat to [None, n_channel, sequence_length, hidden_size]
@alamhanz let's move to 256 for making a smaller model, once we get underfitting, we set the parameters higher
the text has 1 channel, and height would be equal to sequence length, and hidden size would be width image
, actually,. in my opinion,. you can do something like this.. with sequence length
x hidden size
must equals to Square numbers like 16 or 25.. the other problem also.. if we create 1 channel with your set up like that, Will conv2D in multi_modal_network
make sense for Text?? or do you have plan to use the Conv1D instead?? do you get what I mean right?The reason I suggest to set up the length of sequence as channel and the (1, hidden layers) as (lenght, width) to make the conv2D still make sense after the concatenate with the reshape of images (in my opinions)..
what you think?
I get it... I need to calculate max sequence length first then
@alamhanz let's set fix length to 25
num_vocab=8476
setting up TextEncoder(num_vocab, 512, 256, 1)
gives Trainable parameters 5,128,192
TextEncoder output
torch.Size([2, 1, 25, 256])
ImageEncoder would be like this
mobilenet = models.mobilenet_v2()
backbone = mobilenet.features
model = nn.Sequential(
backbone,
nn.Conv2d(in_channels=1280, out_channels=256, kernel_size=(3,3))
)
with input size img = torch.rand(1, 3, 224, 224)
with output size torch.Size([1, 256, 5, 5])
Trainable parameters: 5,173,248
Scenario Concatenation:
loss_contrastive = torch.mean((batch_label_c) * torch.pow(euclidean_distance, 2) +
(1-batch_label_c )* torch.pow(torch.clamp(0.5 - euclidean_distance, min=0), 2))
https://github.com/hadikazemi/Machine-Learning/blob/master/PyTorch/tutorial/simese_cnn.py#L137
euclidean_distance = F.pairwise_distance(features_1, features_2)
loss_contrastive = torch.mean((1 - batch_label_c) * torch.pow(euclidean_distance, 2) +
batch_label_c * torch.pow(torch.clamp(2 - euclidean_distance, min=0.0), 2))
Total params:
Trainable params: 11,582,542
@alamhanz check C41_training.ipynb
architecture changed dramatically, works on batch size 64. first batch skip, overfitting
epoch: 1 | time: 5.2s
train loss: 0.80 | train accuracy: 59.38
val loss: 1.29 | val accuracy: 53.12
epoch: 2 | time: 2.3s
train loss: 0.77 | train accuracy: 71.88
val loss: 16.75 | val accuracy: 65.62
epoch: 3 | time: 2.3s
train loss: 1.87 | train accuracy: 62.50
val loss: 2.35 | val accuracy: 59.38
epoch: 4 | time: 2.3s
train loss: 3.87 | train accuracy: 68.75
val loss: 1.01 | val accuracy: 59.38
epoch: 5 | time: 2.4s
train loss: 0.45 | train accuracy: 71.88
val loss: 1.13 | val accuracy: 62.50
epoch: 6 | time: 2.3s
train loss: 0.56 | train accuracy: 68.75
val loss: 1.56 | val accuracy: 65.62
epoch: 7 | time: 2.3s
train loss: 0.25 | train accuracy: 81.25
val loss: 2.08 | val accuracy: 62.50
epoch: 8 | time: 2.3s
train loss: 0.14 | train accuracy: 84.38
val loss: 1.51 | val accuracy: 59.38
epoch: 9 | time: 2.4s
train loss: 0.11 | train accuracy: 84.38
val loss: 1.66 | val accuracy: 59.38
epoch: 10 | time: 2.3s
train loss: 0.09 | train accuracy: 90.62
val loss: 2.03 | val accuracy: 59.38
epoch: 40 | time: 395.8s
train loss: 0.28 | train accuracy: 86.39
val loss: 1.09 | val accuracy: 69.79
epoch: 41 | time: 393.4s
train loss: 0.29 | train accuracy: 85.51
val loss: 1.11 | val accuracy: 70.10
epoch: 42 | time: 396.4s
train loss: 0.29 | train accuracy: 86.80
val loss: 1.09 | val accuracy: 70.79
epoch: 43 | time: 396.5s
train loss: 0.29 | train accuracy: 84.71
val loss: 1.16 | val accuracy: 69.17
epoch: 44 | time: 397.7s
train loss: 0.27 | train accuracy: 86.17
val loss: 1.06 | val accuracy: 69.98
epoch: 45 | time: 392.0s
train loss: 0.27 | train accuracy: 86.84
val loss: 1.09 | val accuracy: 70.60
epoch: 46 | time: 393.7s
train loss: 0.26 | train accuracy: 87.49
val loss: 1.12 | val accuracy: 70.60
epoch: 47 | time: 393.4s
train loss: 0.26 | train accuracy: 87.40
val loss: 1.16 | val accuracy: 69.98
epoch: 48 | time: 389.3s
train loss: 0.27 | train accuracy: 88.18
val loss: 1.07 | val accuracy: 71.33
epoch: 49 | time: 398.5s
train loss: 0.26 | train accuracy: 86.50
val loss: 1.07 | val accuracy: 69.56
epoch: 50 | time: 395.3s
train loss: 0.25 | train accuracy: 88.29
val loss: 1.08 | val accuracy: 70.68
Image similarity
Multi modal