Closed vuhongai closed 2 years ago
Hi Ai
The idea of that function is to incorporate the reverse complement DNA strand since the strand is arbitrary and transcription factors bind in either orientation. It is more efficient for a variety of reasons to reverse complement the convolutional filters rather than the input DNA sequences. So that layer uses reverse complemented filters to do a 1D convolution. The reverse complement actually works by flipping the filters along the length and base axes (and the bases have to be in a specific order so that the flip actually represents complementation; so ACGT works since TGCA is its complement, but ATGC does not since CGTA is not the complement of ATGC).
Are your protein sequences a one hot encoding of amino acids? e.g. 20xL (L=length; 20 - one per amino acid)
If so, I cannot explain why the rc_Conv1D would help as it really doesn't make sense for that application. There is no logical meaning to a reverse complement of a protein sequence. Have you tried the same model, but without the rc_Conv1d? If not, perhaps it is the rest of the model that is helping.
Thank you for your considerate response. And yes you're right, if I replace with normal Conv1D, it works as well. Anyway, thank you for sharing the architecture, it learns very robustly.
Ai
Thanks so much for your question @vuhongai! I am just adding a couple points to @Carldeboer's excellent answer:
For the regulatory sequence->function model in the paper, the rc_Conv1d block was originally introduced because it was often present in prior work on related problems. 'Towards a Better Understanding of Reverse-Complement Equivariance for Deep Learning Models in Regulatory Genomics' does a great job of conveying the intuition behind why one may want to employ such reverse-complement aware models and thoroughly benchmarking various existing strategies (This twitter thread summarizes the paper's findings). Thus, the primary reason rc_Conv1d is used in our model architecture is historical. The model performs well, without rc_Conv1d, even for the regulatory sequence->function prediction task described in the paper.
It is great to hear that the model architecture learns protein sequence->function robustly! I agree with @Carldeboer's suggestion that it is likely the rest of the model that is helping. (If there are more details available for the protein->sequence function task, like the function label (eg: GFP expression) and length of input protein sequence (eg: ~238 AAs), we can try to think more deeply about why the architecture is able to learn robustly for your specific application.) I found 'Learning protein fitness models from evolutionary and assay-labeled data' to be a great starting point for building intuition on the protein sequence->function task. It wonderfully demonstrates how simple linear regression models on site-specific amino acid features often perform quite well at protein sequence->function task. I quote a paragraph I liked from their discussion below:
One may initially be puzzled as to how linear models with only site-specific features can generalize at all to mutations not seen at train time, let alone surpass nonlinear models in this task. However, as detailed in the Methods, such generalization can emerge from a particular form of l2-regularized linear regression with one-hot encoded features. In effect, through the regularization, the model learns about the importance of each position, even though each amino acid at each position has its own parameter. Thus, if the effects of different mutations at the same position are in the same direction, the regularized linear models can do a reasonable job of generalizing in such a manner.
Best, Eeshit
Thank you for your clarification and the suggestions. I will look into it. Bests, Ai
Hi, Thank you for sharing your awesome work. I have difficulty to understand the intuition of the use of rc_Conv1D block in the attention-based model, since it is new for me. If I understand correctly, you're trying to mimic the DNA structure with forward and reverse stand, right? I have tried with a dataset of protein sequence (just modify the function to generate onehot vector) to function, and the architecture works also very well (figure: validative set of the training).
So my question is, what exactly this rc_Conv1D block does in other case rather than DNA sequence? Thank you. Ai