Doubts about pre-training the pMHC BERT model

Freshwind-Bioinformatics / TABR-BERT

TABR-BERT: an Accurate and Robust BERT-based Transfer Learning Model for TCR-pMHC Interaction Prediction

Other

8 stars 1 forks source link

Doubts about pre-training the pMHC BERT model #2

Closed xuanwuji closed 3 months ago

xuanwuji commented 4 months ago

Hi! Thank you for your exciting work! I had some doubts when I looked at the training data of pMHC pre-trained model. i noticed that you use both BA and EL label for regression task(NSP). Why not just use (0,1) labels of Why not just use EL's 0,1 labels for the classification task? How do you think about this point of view? Can you answer my doubts? Thanks a lot!

JiaweiZhang1997 commented 3 months ago

Sorry for not seeing your question in time!

Do you mean that why am I using EL and BA data instead of just EL data? Because more data can give more prior information to our embedding model and some previous models (MHCflurry, NetMHCpan) provide a way to normalize the labels of BA data to 0-1(EL = 1-log_50000(BA)).

If that doesn't answer your question, feel free to ask! ^.^

xuanwuji commented 3 months ago

Thanks for the answer! I think I understand the motivation for supplementary data volume by you explanation. but i am curious about that there are a large number of binary label (0 ,1) and a small proportion of consecutive labels (1-log_50000(BA)). in my point of views, binary label (0 ,1) is more suitable for classification tasks and consecutive labels (1-log_50000(BA)) if more suitable for regression tasks. i understand you have to use MSELoss if you have consecutive labels in your data. So the reason you used MSELoss calculation is simply because of adding continuous labels in the training data? In other words, have you ever tried to use a binary loss such as BCEloss/CrossEntropyLoss for NSP tasks like the original BERT?

by the way, can you speak chinese? You can answer in Chinese for convenience if you like. ∩ˍ∩

本人才浅学疏，望解答！

JiaweiZhang1997 commented 3 months ago

那么，你使用MSELoss计算的原因仅仅是因为在训练数据中添加了连续标签？ -是的，使用MSEloss是因为训练数据中标签是连续的。

换言之，你是否曾尝试过使用二元loss（如BCEloss/CrossEntropyLoss）用于NSP任务（如原始BERT）？ -这里可能有一个历史原因，在之前的研究中（pMHC结合预测问题），我们就使用了BA/EL两部分数据用于训练（在pMHC结合预测问题上，同时使用这两种数据往往有较好的效果）。所以在TABR-BERT中就直接延用了这个思路，没有单独使用EL数据进行训练。

-此外，EL数据相对于BA数据有更多的负例（label为0），BA数据中的正负例比例更为合理，更适合训练。也许单独使用EL数据训练pMHC embedding model 会有更好的效果，但我们没有在这方面进行太多的尝试。 -。-

xuanwuji commented 3 months ago

非常感谢您的解答！受益匪浅！