Closed alessiodiianni closed 11 months ago
1. From what I understood from the Nat. Biotech paper the algorithm combines co-evolutionary information with sparse XL-MS data embedded in a pair representation matrix (and in turn update the MSA representation so to bias the retrieval of co-evolutionary information from the MSA). No template PDB structures are used to train the neural network, unlike AlphaFold which also has the possibility to use PDB structures from the PDB as input, is that correct?
Correct for AlphaLink (Nat. Biotech, monomer), for AlphaLink2 (this repository, multimer) we used templates.
Where can I find the templates used for MSA when the templates option is checked?
You can find the templates in the prediction folder.
ls prediction
For example:
prediction/alphalink_colab_17cb8/pdb70.m8 contains the templates.
prediction/alphalink_colab_17cb8/templates_101 contains the .cif.
What is the default Neff value it uses for MSA when templates is checked?
We don't subsample the MSAs here, so the Neff is unchanged and corresponds to whatever MMSeq2 gives us.
In case I use only a single sequence, will it use the sequence with the highest amount of overlap with my input sequence?
I haven't checked the single sequence mode, it's part of the original notebook. AFAICT it currently doesn't do anything. What it probably should do is just skip MSA generation and use only the input sequence (so no MSA information at all). I should remove this option.
Also, in case the templates option is not checked, what MSA database is it using?
The templates shouldn't affect the MSA database that is used. MSA databases are uniref and mgnify/ bfd.
3. Is it possible to generate different models from the same run by using the same cross-links list, e.g. by changing the num_ensembles or other parameters? I tried but only one model has been generated.
Yes, by increasing "times".
4. Is it possible to gradually change in the notebook the number of effective sequences (Neff) used for MSA so to see the effect that this have on the overall prediction?
No, not at the moment. Keep in mind that MSA subsampling has a much stronger effect in multimer since you affect both the monomer and the interface predictions.
- Is increasing the number of network reiterations a good way to generate better model in general?
Increasing recycling will generally improve performance. It seems to help most for low Neff targets.
More on the local installation performance: -How much time does a prediction usually takes (by using a machine with not the latest GPU hardware? I am interested cause it would be nice to get one so If you can share your experience that would be tremendously useful.
That depends very much on your target size and your parameters. Prediction times scales linearly with the number of recycling iterations. Prediction time (in theory) scales cubically with your target size. Predicting Cullin4 in the paper took around 16 hours if I remember correctly. I cannot talk about older hardware. We make a lot of use of the recent hardware (A100) with flash-attention and bfloat16. Even one generation (V100s) earlier will probably be slower by a factor of 4 and will limit the target size considerably (could try fp16).
@lhatsk thanks a lot for the responses to all my questions.
The templates and features are not included in the download by default. You would need to download them yourself. This is possible by creating a new code cell (+ Code) and executing the following code snippet:
! tar cvfz inputs.tar.gz {output_dir}
files.download('inputs.tar.gz')
Thanks a lot for the clarifications and the great help.
Thanks a lot for developing such a promising tool for modeling previously unsolved protein structures! I have several questions regarding the notebook and few questions about the performances after local installation: 1) From what I understood from the Nat. Biotech paper the algorithm combines co-evolutionary information with sparse XL-MS data embedded in a pair representation matrix (and in turn update the MSA representation so to bias the retrieval of co-evolutionary information from the MSA). No template PDB structures are used to train the neural network, unlike AlphaFold which also has the possibility to use PDB structures from the PDB as input, is that correct? 2) Alphalink2 in the notebook is using MMseqs2 for performing Multiple Sequence Alignment (same as AlphaFold). Where can I find the templates used for MSA when the templates option is checked? What is the default Neff value it uses for MSA when templates is checked? In case I use only a single sequence, will it use the sequence with the highest amount of overlap with my input sequence? Also, in case the templates option is not checked, what MSA database is it using? 3) Is it possible to generate different models from the same run by using the same cross-links list, e.g. by changing the num_ensembles or other parameters? I tried but only one model has been generated. 4) Is it possible to gradually change in the notebook the number of effective sequences (Neff) used for MSA so to see the effect that this have on the overall prediction? I have one test case in which AlphaFold already predicts accurately the complex and plotting the cross-links into the AlphaFold structure, all of them are already satisfied. Also AlphaLink2 generates an accurate model with 100% cross-links satisfaction (here I guess cross-links did not add great bias). In another case, both are predicting a wrong complex (more difficult scenario in which the binder is not bound on a large flat surface but on a quite small area and particularly on a loop). Plotting the cross-links in the X-ray model results in 50-50% satisfaction (my XL search was filtered at 1% FDR and manually inspected to check for good MS/MS spectra). Anyway I tried to remove all the overlength cross-links (even though I've read the software is able to handle up to 50% false positive using simulated restraints) and only use the one within the 25 A distance for which the algorithm already has predifined network weights but there's not so much difference, even if the cross-links are satisfied. So I was thinking in this case to reduce the weight of MSA so to increase cross-links bias and maybe get the correct model. 5) Is increasing the number of network reiterations a good way to generate better model in general?
More on the local installation performance: -How much time does a prediction usually takes (by using a machine with not the latest GPU hardware? I am interested cause it would be nice to get one so If you can share your experience that would be tremendously useful.
Thanks in advance and again, thanks for developing the tool