Status of linear layer in pre-trained models

Dear colleagues, This is really a question rather than an actual issue as far as I can see, but I would very much appreciate if you could help me with this. (The question is more fundamentally related to DNABERT I think, since the pretrained models that can be downloaded from the DNABERT repository have the same structure.)

Could you please clarify what the status of the linear layer in your two pre-trained models (i.e. in bac_pretrained_model and pha_pretrained_model, without loading the state dict from INHERIT.pt) is? The two pre-trained models you are using in INHERIT look like this:

================================================================================
Layer (type:depth-idx)                                  Param #
================================================================================
BertForSequenceClassification                           --
├─BertModel: 1-1                                        --
│    └─BertEmbeddings: 2-1                              --
│    │    └─Embedding: 3-1                              (12,003,840)
│    │    └─Embedding: 3-2                              (393,216)
│    │    └─Embedding: 3-3                              (1,536)
│    │    └─LayerNorm: 3-4                              (1,536)
│    │    └─Dropout: 3-5                                --
│    └─BertEncoder: 2-2                                 --
│    │    └─ModuleList: 3-6                             (85,054,464)
│    └─BertPooler: 2-3                                  --
│    │    └─Linear: 3-7                                 (590,592)
│    │    └─Tanh: 3-8                                   --
├─Dropout: 1-2                                          --
├─Linear: 1-3                                           (1,538)
================================================================================
Total params: 98,046,722
Trainable params: 0
Non-trainable params: 98,046,722
================================================================================

The linear layer at the top of the network (i.e. the bottom of this chart) is a 768 by 2 matrix, and the whole pre-trained model is a BertForSequenceClassification. This is exactly the same in DNABERT's pretrained models as well, with the difference that the vocabulary is different there (no N bases, as explained in your appendix). However, as you explained in your paper, what was in fact pretrained was not a sequence classification model, but rather a masked language model (i.e. BertForMaskedLM), which should have a BertOnlyMLMHead layer on top.

Is my understanding correct that you removed the BertOnlyMLMHead layer, retained the pretrained BertModel structure, and added a linear layer on top which has not been pretrained, but is rather just randomly initialized? If this is the case, where in the code does this replacement of the masked model head by the linear layer happen?

It would seem that the linear layer is not pretrained since I tried using your bac and pha pretrained models on their own, separately, using the output of their linear layers (adding a softmax on top) to classify a sample phage genome, and neither model seems to be able to classify the sequence, both return labels around 0.5 for all of its segments.

An unrelated question that came to my mind while examining your solution: Is there any particular motivation for using these linear layers with just 2 output values in your combined model? I would think that these represent a very narrow information bottleneck, so concatenating the 768-dimensional outputs of the two BERT models directly and putting an 1536 by 1 dense layer with sigmoid activation on top of this would seem like a more straightforward solution than first drastically reducing the dimensionality of the output from 768 to 2, concatenating the 2 two-dimensional output vectors, and then adding a 4 by 1 dense layer on top.

Hi gpetho,

Thank you so much for your asking. I'm happy to be able to answer this question to you. As you can see, a DNABERT (hereinafter referred to as BERT) can be divided into the following parts: embedding layer, encoder layer and task-specific layers. In BertForMaskedLM, this task-specific layer is only for MLM. When Huggingface Transformers reads this pre-trained model, it only transfers the weights of the embedding layer and the encoder layer. For task-specific layers (BertPooler and Classifier), they are initialized randomly. I actually had concerns about your second question in the process of doing the experiment. But there are actually two reasons: from the actual results, truth be told, direct classification using the represenations of two [CLS] tokens directly did not achieve very good results and was sensitive to the learning rate. I did try, but the results were not really good, so I gave up. On the other hand, BertForSequenceClassification will have a BertPooler layer that contains a Linear layer and a Tanh( ) activation function, and then go through the linear classifier. Therefore, it is not the same as using a linear layer directly. So based on this, we use this structure. I hope these can help. DNABERT is actually an extended version of the BertModel for Huggingface transformers. In the future I plan to optimize the code to allow us to call the latest version of Huggingface transformers directly to make it easier to use INHERIT.

Thank you very much for your fast and informative reply.

When Huggingface Transformers reads this pre-trained model, it only transfers the weights of the embedding layer and the encoder layer. For task-specific layers (BertPooler and Classifier), they are initialized randomly.

I see, thank you for clarifying this. I see now that the saved pretrained model only contains the configuration and weights of the BertModel, and when I run something like this, a randomly initialized Linear layer is added on top:

bacbert = BertForSequenceClassification.from_pretrained(bac_bert_dir)
phabert = BertForSequenceClassification.from_pretrained(pha_bert_dir)

Actually, I have checked now, and the BertPooler layers (or rather the weights inside the dense layer within BertPooler) are in fact saved and loaded along with the embedding and encoder layer. It is only the Linear layer on the very top of the BertForSequenceClassification model that is randomly initialized when the pretrained model is loaded.

On the other hand, BertForSequenceClassification will have a BertPooler layer that contains a Linear layer and a Tanh( ) activation function, and then go through the linear classifier. Therefore, it is not the same as using a linear layer directly.

Sorry, I can't really follow how this is relevant to the information bottleneck issue. What I was trying to ask was why you are using a 4-dimensional vector as the input to your final classifier layer rather than the 1536-dimensional encoded representation from the BERT models. I believe it is relatively unimportant whether you use the BertEncoder's output for the [CLS] token, which is a 768-dimensional vector, or whether you use the output of the additional dense layer on top of it, i.e. the output of the BertPooler layer of the same dimensionality. This shouldn't matter much, since you will be concatenating the output vectors of the two BERT models and putting a dense classifier layer on top of the concatenated vector anyway. (In other words, the BertPooler layer isn't doing anything particularly interesting in terms of getting the correct classification, it's just a simple dense layer.)

My actual concern is only related to the dimensionality of the vector that the classification decision is based on. What your architecture is basically achieving is having the two BERT models each output a classification decision for the input sequence, and then use the small final regressor layer to weight the classification decisions output by the two models. Of course this description is not quite accurate, since the two models aren't outputting a single feature each, e.g. 1 for bac and 0 or -1 (depending on the activation function) for pha, but rather two features each, so this is slightly more information than what would be strictly necessary for a binary classification decision, but really not much more than that. On the other hand, if you used the encoding of the whole input sequence as input of the final joint classifier layer instead (regardless of whether that encoding comes from the output of the BertEncoder or from the output of the BertPooler), that's rather a lot of information to work with. I mean roughly like this:

class Baseline_IHT(torch.nn.Module):
    def __init__(self, ...):
        ...
        self.regressor = torch.nn.Linear(1536, 1)

    def forward(self, input_ids, token_type_ids, attention_mask):
        bac_bert_output = self.bacbert.bert(input_ids=input_ids, token_type_ids=token_type_ids,
                                            attention_mask=attention_mask, return_dict=True)
        pha_bert_output = self.phabert.bert(input_ids=input_ids, token_type_ids=token_type_ids,
                                            attention_mask=attention_mask, return_dict=True)
        combined_output = torch.cat((bac_bert_output.pooler_output, pha_bert_output.pooler_output), axis=1)
        out = self.regressor(combined_output)
        ...

In fact, I would add one or two extra dense layers before the final classifier layer myself, e.g. a 1536 by 256 layer with relu or something like that.

But I understand that you have experimented with similar arrangements, and found that the prediction results were not very satisfactory. I find this very puzzling. I understand that you have fine-tuned the whole composite model including the pretrained models, but have you by any chance tried freezing the weights of the two BERT models and training just the final linear layers and the regressor layer?

Come to think of it, I was wondering about two more things related to pretraining of the two models. I'm sorry for abusing this issue to ask these questions, but it's easier this way: When pretraining on the training sequences, were these simply passed to the model, or did you manipulate the sequences in various ways, like reversing or selecting segment windows randomly from a longer sequence? So for example if you had a 5 kb long training sequence, did you simply split this up into 10 segments in a linear way, first segment base 1 to 500, second segment base 501 to 1000, etc., or did you do things like reversing the sequence string and training on that as well, or selecting shifted windows, e.g. base 1 to 500, then 101 to 600, or maybe selecting random segment windows like base 1358 to 1857? This would be a very straightforward way of increasing the amount of training data for the phage sequences in particular, since you mentioned the problem of relatively sparse data in that context in the paper. Sorry if you have covered these questions in the paper, I can't recall.

We have tried all the things you mentioned, including freezing two BERT models. this is called Linear probing. linear probing does not perform as well as fine-tuning in most cases, and we have got similar experimental results. INHERIT is a huge model and each training has a large computational cost. So we did not use the sliding window strategy, but used the previous strategies like Seeker, DeepVirFinder.

Thank you very much for this information as well.

Celestial-Bai / INHERIT

Status of linear layer in pre-trained models #2