Closed weicheng113 closed 2 years ago
Hi Cheng,
Thanks for your question! After reading it a few times I see what you mean. I'm going to refer to the channel dimension as conv_dim since it seems to fit your naming system.
So we want to preserve the structure of ts_feature_value_dim (i.e. not just append conv_dims to it) so that each feature in this axis refers to one variable. This could be heart rate, if we are looking at the original set in F, or it could be an output of a previous pointwise layer i.e. a new variable which has been informed by a variety of the original variables e.g. one such variable could be weighted to indicating "lung health", while another could be primarily concerned with the kidney function. These features would be represented in the Zt component of the ts_feature_value_dim.
If we were to just append each conv_channel to the ts_feature_value_dim, then we are effectively treating each conv output channel as a new feature, so the conv part of the model wouldn't be aware that some of the outputs of the previous conv_channels are related to one another in a structured way. The problem with this is that as we stack the convolutions on top of one another, we wouldn't be able to get such complex temporal processing given to us by having multiple conv channels that can extract different temporal signals using the same kernel dimensions on the same variable.
I don't know if that makes sense? I'm happy to go back and give some further explanation on my thoughts if it doesn't. The difficulty with that approach (as you correctly point out!) is that we only have one value for the pointwise outputs when they are first formed. So we have repeated it to fit the conv dimension so that the sizing works for the next conv layer. The model will need to work out that there is no useful signal in the conv dimension for those and just focus on creating useful signals from the temporal dimension. In all future layers however, the model can combine information from a variety of previous conv channels to form the next layer of processing on that feature. As a side note, there may have been a way to handle the newly added pointwise outputs in a special way such that they were treated as having a channel dimension of 1 while the others had a dimension of conv_dim. I didn't go down that route but perhaps it would be a more "ideal" way of handling it, so that the model doesn't need to do any extra work. There may have been a better reason at the time why I didn't other than "it's more faff and it doesn't work neatly with the library I'm using", but if there is I've forgotten it. In any case I think the model should easily be able to handle a few repeated values in the conv channels.
Let me know if I addressed the whole of your question? Looking at Figure 3 in my paper might also help as it shows the flow of the dimensions through each layer. I appreciate that it is really complicated to keep track of all the dimensions, even for me as the one who designed it so I'm impressed you followed everything through as you did.
Emma :)
Thanks a lot for the detailed explanation, Emma. I will need some time to digest your comments and also re-read the relevant part in the paper. Sometime the code appears to be more concrete to me than the paper, but the paper gives high level understanding. Thanks.
Hi Emma,
With your comments, figure 3, re-reading of the paper and confirming with the code, I feel I have good understanding of temp_pointwise now. I summarize my understanding below. The description of Figure 3 below is a bit dry(I wrote it against your figure 3). If you have time, please help check my understanding. Any suggestions are highly appreciated. I wrote it with LyX tool, which is easier for formulas.
Thanks, Cheng
That’s great! I’ve gone through it carefully and it is all correct, you have really understood it well. As a quick test, maybe you can tell me what happens to the mask features (the second channel in the original data which indicates how recent the measurement was taken)?
Thanks a lot Emma for your time, I am not sure if I understand your test.
a. The mask is paired with its corresponding feature. A grouped kernal 2*kernel_size will do dot product with feature mask pair across time_range=kernel_size(past up till current timepoint t).
b. I have not looked in details about the calculation of mask decay fields in preprocessing part(my classmate is working on this part). I originally thought it was indicator of whether a field has measurement value(or presence). But I can see there is more details about it in paper and corresponding preprocessing code.
If I did not understand your question correctly, please let me know. Thanks.
a. Yes you’re right. Sorry I wasn’t clear! I meant the test was to see that you’d understood that the mask features only appear as a second channel in layer one. But they disappear after that (I.e. they are not propagated forwards as skip connections unlike the feature values themselves). This mostly because the added pointwise features don’t have mask variables.
Thanks Emma. My understanding is in the first layer, the groups effect C^n
is on feature mask pair. From second layer onwards, the groups is applied on covolution_channels(13=12 conv + 1 skip connection in 2th layer) of each normal feature and the groups is applied on the repeated values of each pointwise feature. Feature mask pair in the first layer can be regarded as special two channels for a feature.
Ok, maybe it can be incorporated into the second layer onwards(for example 13 becomes 14 = 12 conv + 1 skip connection + mask field in 2th layer) and repeat pointwise 14 times in 2th layer(but as you point out there is no corresponding mask field here to match).
Yes exactly. You’ve understood it very thoroughly!
Thank you very much for your time and guidance, Emma. I am very lucky to have confirmation and feedback from the paper author.
Hi Emma,
I got some conceptual questions regarding temp_pointwise implementation. I marked 3 steps in the following source code for questions. The comments are my understanding and there are 4 lines below extracted from your source code.
At
step 3 X_combined
, my understanding for the reason of concatenatingtemp_skip
andX_point_rep
along ts_feature_value_dim is thatX_point_rep
contains representation at ts_feature_value_dim level. If so, why don't do the following:So flatten temp_skip so that it can be concatenated with point_output at ts_feature_value_dim level.
I actually have difficulty in understanding the reasoning to repeat each point_size value (1+temp_kernals) times at
step 2 X_point_rep
. The only reason I can think of is to match the dimension with temp_skip. But with the repeation, willnext_X
contain (1+temp_kernals) repeated value at dim=1, which will not add information for network?Asking source code in text is a bit difficult. I am not sure if I state my question clearly.
Thanks in advance for your time and help, Cheng