Rappsilber-Laboratory / AlphaLink2

AlphaLink2: Integrating crosslinking MS data into Uni-Fold-Multimer
Creative Commons Attribution 4.0 International
50 stars 16 forks source link

Questions about notebook and performances #13

Closed alessiodiianni closed 11 months ago

alessiodiianni commented 1 year ago

Thanks a lot for developing such a promising tool for modeling previously unsolved protein structures! I have several questions regarding the notebook and few questions about the performances after local installation: 1) From what I understood from the Nat. Biotech paper the algorithm combines co-evolutionary information with sparse XL-MS data embedded in a pair representation matrix (and in turn update the MSA representation so to bias the retrieval of co-evolutionary information from the MSA). No template PDB structures are used to train the neural network, unlike AlphaFold which also has the possibility to use PDB structures from the PDB as input, is that correct? 2) Alphalink2 in the notebook is using MMseqs2 for performing Multiple Sequence Alignment (same as AlphaFold). Where can I find the templates used for MSA when the templates option is checked? What is the default Neff value it uses for MSA when templates is checked? In case I use only a single sequence, will it use the sequence with the highest amount of overlap with my input sequence? Also, in case the templates option is not checked, what MSA database is it using? 3) Is it possible to generate different models from the same run by using the same cross-links list, e.g. by changing the num_ensembles or other parameters? I tried but only one model has been generated. 4) Is it possible to gradually change in the notebook the number of effective sequences (Neff) used for MSA so to see the effect that this have on the overall prediction? I have one test case in which AlphaFold already predicts accurately the complex and plotting the cross-links into the AlphaFold structure, all of them are already satisfied. Also AlphaLink2 generates an accurate model with 100% cross-links satisfaction (here I guess cross-links did not add great bias). In another case, both are predicting a wrong complex (more difficult scenario in which the binder is not bound on a large flat surface but on a quite small area and particularly on a loop). Plotting the cross-links in the X-ray model results in 50-50% satisfaction (my XL search was filtered at 1% FDR and manually inspected to check for good MS/MS spectra). Anyway I tried to remove all the overlength cross-links (even though I've read the software is able to handle up to 50% false positive using simulated restraints) and only use the one within the 25 A distance for which the algorithm already has predifined network weights but there's not so much difference, even if the cross-links are satisfied. So I was thinking in this case to reduce the weight of MSA so to increase cross-links bias and maybe get the correct model. 5) Is increasing the number of network reiterations a good way to generate better model in general?

More on the local installation performance: -How much time does a prediction usually takes (by using a machine with not the latest GPU hardware? I am interested cause it would be nice to get one so If you can share your experience that would be tremendously useful.

Thanks in advance and again, thanks for developing the tool

lhatsk commented 1 year ago
1. From what I understood from the Nat. Biotech paper the algorithm combines co-evolutionary information with sparse XL-MS data embedded in a pair representation matrix (and in turn update the MSA representation so to bias the retrieval of co-evolutionary information from the MSA). No template PDB structures are used to train the neural network, unlike AlphaFold which also has the possibility to use PDB structures from the PDB as input, is that correct?

Correct for AlphaLink (Nat. Biotech, monomer), for AlphaLink2 (this repository, multimer) we used templates.

Where can I find the templates used for MSA when the templates option is checked?

You can find the templates in the prediction folder.

ls prediction

For example:

prediction/alphalink_colab_17cb8/pdb70.m8 contains the templates.

prediction/alphalink_colab_17cb8/templates_101 contains the .cif.

What is the default Neff value it uses for MSA when templates is checked?

We don't subsample the MSAs here, so the Neff is unchanged and corresponds to whatever MMSeq2 gives us.

In case I use only a single sequence, will it use the sequence with the highest amount of overlap with my input sequence?

I haven't checked the single sequence mode, it's part of the original notebook. AFAICT it currently doesn't do anything. What it probably should do is just skip MSA generation and use only the input sequence (so no MSA information at all). I should remove this option.

Also, in case the templates option is not checked, what MSA database is it using?

The templates shouldn't affect the MSA database that is used. MSA databases are uniref and mgnify/ bfd.

3. Is it possible to generate different models from the same run by using the same cross-links list, e.g. by changing the num_ensembles or other parameters? I tried but only one model has been generated.

Yes, by increasing "times".

4. Is it possible to gradually change in the notebook the number of effective sequences (Neff) used for MSA so to see the effect that this have on the overall prediction?

No, not at the moment. Keep in mind that MSA subsampling has a much stronger effect in multimer since you affect both the monomer and the interface predictions.

  1. Is increasing the number of network reiterations a good way to generate better model in general?

Increasing recycling will generally improve performance. It seems to help most for low Neff targets.

More on the local installation performance: -How much time does a prediction usually takes (by using a machine with not the latest GPU hardware? I am interested cause it would be nice to get one so If you can share your experience that would be tremendously useful.

That depends very much on your target size and your parameters. Prediction times scales linearly with the number of recycling iterations. Prediction time (in theory) scales cubically with your target size. Predicting Cullin4 in the paper took around 16 hours if I remember correctly. I cannot talk about older hardware. We make a lot of use of the recent hardware (A100) with flash-attention and bfloat16. Even one generation (V100s) earlier will probably be slower by a factor of 4 and will limit the target size considerably (could try fp16).

alessiodiianni commented 1 year ago

@lhatsk thanks a lot for the responses to all my questions.

  1. Thanks for the clarification. So the checkbox for templates is indicating the PDB templates? (here in the snippet) image
  2. Doing it from the notebook, I can download a prediction folder which contains the pdb.file of the predictions, json files with plddt and ptm scores for the model and the chain id mapping, finally html files for the plots generated in the notebook (PAE, pLDDT). Ok so changing MSA mode from MMseqs2 to single sequence is basically deactivating the MSA step (i think it is still a good think to keep to evaluate performances as I am trying to do now) image I don't have any information about the msa templates (pdb70.m8 file I guess) nor PDB templates in .cif format. Could you help me out with that? Thanks a lot for the MSA question's explanation, so Neff varies depending on each query sequence you feed the notebook.
  3. Thanks for this, will try it out
  4. I see, but I would like to see if giving more weight to my cross-links helps in generating a better solution
  5. Thanks for this as well, it is a very good indication. Looking forward to your responses and to try it again in these days
lhatsk commented 1 year ago
  1. use_templates includes templates from the pdb70
  2. The Neff is the number of effective sequences in an MSA. This varies based on the input sequence and databases being used. In addition, since AlphaFold2/ AlphaLink2 limits the number of sequences in an MSA (due to memory limitations), each MSA is randomly subsampled which also varies the Neff. MSA masking is another source of non-determinism which can affect the Neff.

The templates and features are not included in the download by default. You would need to download them yourself. This is possible by creating a new code cell (+ Code) and executing the following code snippet:

! tar cvfz inputs.tar.gz {output_dir} files.download('inputs.tar.gz')

alessiodiianni commented 1 year ago

Thanks a lot for the clarifications and the great help.