TencentAI4S / tfold

open source code for Tencent tFold
Other
57 stars 9 forks source link

Could you please help me with boosting the performance? #9

Open bzhousd opened 1 month ago

bzhousd commented 1 month ago

Hi tfold team,

I applied tFold-ag to the AbDB dataset, and while it performs well on some proteins, I encountered difficulties predicting other proteins. Upon inspecting the predicted structures, I noticed that the distance between two residues in the antigen chain is too large. I’m curious whether tFold-ag offers settings similar to the number of recycling steps in AlphaFold-Multimer. This would allow me to run additional recycling steps and refine the structures further.

Another potential issue in my predictions is related to multiple sequence alignment (MSA). Since Mmseq2 is too slow, I opted to use the ColabFold MSA server. Specifically, I’m using the default settings to generate an a3m file. However, I’m uncertain whether the ColabFold MSA process aligns with the options used by MMSeq2 in tfold

Thanks

wufandi commented 1 month ago

Hi, Thank you for your interest in our work.

'Upon inspecting the predicted structures, I noticed that the distance between two residues in the antigen chain is too large.' This is rarely seen in our previous tests. I think this might be due to the insufficient number of MSAs for the antigen, such as certain viruses. You could use a larger database to search for MSAs, which might improve the model's performance.

Indeed, we are internally training a model with a similar number of recycling steps as in AlphaFold-Multimer, but the model is not yet fully trained, so we don't have more information to provide.

You can use the ColabFold Server. We have not made any specific customizations to the search options on mmseqs.

bzhousd commented 1 month ago

Thank you for your response. We have identified approximately 100 proteins from AbDB, and these proteins exhibit predictability according to AlphaFold2 (with an average DockQ score of 0.98). However, I have not obtained similar results using Tfold. For instance, consider the following example: the distance between the last two amino acids in the antigen is quite large, measuring 7.2 Å. Additionally, the location of the interface appears to be incorrect.

To investigate further, I examined the relationship between the length of multiple sequence alignments (MSAs) and the DockQ score (as shown in the figure below). the quality of the interface is compromised when the MSA fails to identify sufficient homologous sequences. In ColabFold, the sequence database used is UniRef100. While this is a substantial database, we have local ColabFold MSA server installed. but I don't know how to utilize it to get more MSA.

In case, I didn't apply the model correctly, I wonder if you could run the model in your side for some AbDB proteins( '3H3P','4WHT','6BZY','4ZTO','4GAJ','4HZL','2B1H','6D0X','4G6A','6DB7','5EA0','3SKJ','4HS6','3MLW','4XGZ','4GAG','4XAW','1EJO','5U3O','4TUK','4XCF','1P4B','4TQE','3FFD','6B5M','6FY3','5WNA','4ZFO','4HPO','4XH2','5CIL','4YDV','2AP2','6AXK','1UWX','4N8C','5CIN','5U3L','1MVU','2H1P','3H0T','4HPY','4P3C','4N0Y')

Thanks

image image

wufandi commented 1 month ago

Thank you for your thorough research.

Firstly, comparing the performance of AlphaFold-Multimer and tFold on AbDB is not entirely fair, as this dataset is included in the training set of AlphaFold-Multimer and exists in the template library (of course, this is similar for tFold, but tFold does not use templates).

Besides, I looked at the cases you provided, and these antigens are mostly peptides, which often do not have enough MSAs. In addition, tFold-Ag did not use peptide-antibody complexes for training. During the data preprocessing stage, sequences shorter than 40 were removed. This is also a reason why tFold does not perform well on this part of the test set.

We previously removed this part of the data because we believe that the structure pre-training model used to extract antigen monomer features (in tFold-Ag, this refers to AF2) cannot extract peptide features well. The model tends to remember the structure of peptides, rather than the mapping method from sequence to structure. We will include peptide data in our future versions for training

wufandi commented 1 month ago

Regarding the construction of sequence databases and MSAs, for user convenience, both the code we provide and our server (https://drug.ai.tencent.com/en) use the fastest tool currently available, mmseqs2, to search UniRef and ColabDB. However, for some antigens, especially viruses (such as SARS-CoV-2 RBD), this will miss many MSAs. I suggest you use the same database and homologous sequence search tool as AlphaFold2 to construct MSAs.

bzhousd commented 1 month ago

Thank you for the prompt reply. I have approximately 200 in-house AbAg complex. However, after applying tfold-Ag, I didn’t obtain the correct results. Unfortunately, I cannot share the in-house data due to confidentiality reasons. Instead, I’ve turned to AbDB to find relevant examples to discuss.

Upon reviewing the multiple sequence alignment (MSA) of our in-house data, I noticed that each MSA contains more than 100 sequences. However, one significant issue with our data is the presence of missing residues. This could potentially contribute to low DockQ scores.

wufandi commented 1 month ago

Sorry to hear that, we will continue to improve the performance of our model, new model is already in training.