The size of convolution kernel

Thank you so much for your question! Your understanding is absolutely correct.

The only reason the “convolutional model” (we call it the "gpu model" in the repo) was designed like this is historical: When we started this multi-year project, we first developed and experimented with multiple (unconventional) model architectures. The “convolutional model” was a very early model, with excellent prediction performance, that was used to design some of our validation experiments in Figure 2 (you will notice that it uses the old tf.Session() api). The atypical shapes of our convolutional kernels, as you correctly point out in the question, are first introduced by the input to the “convolutional model”. In our early explorations, we were considering the idea of ‘layering’ on additional information at each position in the input; and this is why there is a singleton dimension in each batch of input data (shape: [batch_size, 1, 110, 4]) for the “convolutional model”. We also experimented with different ways of combining the forward and reverse strands, all of which resulted in the varying shapes of the convolutional kernels during our exploration. The “convolutional model” described in this study was simply the ‘snapshot’ of the model architecture (during our iterations in development) that we happened to use when designing some of our early validation experiments. So unfortunately, there is really no ‘logic’ behind the design of the shapes of kernels in the “convolutional model” architecture (except that it happened to perform well on the expression prediction task). We still needed to describe our (inefficient) “convolutional model” in our manuscript to accurately communicate the original model used to design some of these validation experiments noted above. We could not change the “convolutional model” architecture (to make it more efficient, for instance) after the experiments had concluded because we needed to report the original convolutional model architecture exactly as it was originally (inefficiently) designed, for posterity, in our manuscript. Even though the “convolutional model” predicts expression well (Fig. 1a, Extended Data Fig. 1, etc.), if one were to implement a convolutional model for this task again, we would probably recommend that they design it differently.

In the process of developing an optimal model architecture for this task by iterating and experimenting, we made many changes and additions to the architecture, which lead to the eventual “transformer model” (referred to as the "tpu model" in the repo). The “transformer model” was used for all of the analyses in the study (including the re-analysis of the Figure 2 sections where the “convolutional model” was originally used to design experiments; now part of Extended Data Figure 3 of the final published manuscript. The findings from these sections were also validated by experiments, as noted in the manuscript).

Fortunately, we were able to remove the singleton dimension from the “transformer model’s” input and simplify other aspects of the code and architecture (the kernels here have canonical shapes, as one would expect), significantly reducing the number of parameters. Thus, even though the “convolutional model” and “transformer model” have comparable prediction performance as shown in our manuscript, we recommend that future readers use the “transformer model” for their work. The transformer model will also be more readily compatible with future versions of tf since it does not require the use of the tf.Session() api.

Thank you for bringing this to our notice. We will improve the documentation for the “convolutional model” and suggest that our readers use the transformer model directly in the code as well.

1edv / evolution

The size of convolution kernel #3