Rappsilber-Laboratory / AlphaLink2

AlphaLink2: Integrating crosslinking MS data into Uni-Fold-Multimer
Creative Commons Attribution 4.0 International
46 stars 14 forks source link

Clarification of inputs and outputs. #2

Open mcale6 opened 1 year ago

mcale6 commented 1 year ago

Thanks a lot for updating the code as flash attention installation didnt work for me. :(

Also theres a version conflict (protobuf and tensorboardx) , so i need to isntall pip install protobuf==3.20.0 and set: export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

Then the prediction works with an output like this: ... Inference time: 118.5829746130621 Model 3 Crosslink satisfaction: 0.000 Model confidence: 0.843 .... Model 9 Crosslink satisfaction: 0.000 Model confidence: 0.832 plddts {'AlphaLink2_AlphaLink-Multimer_SDA_v3.pt_78663_0.843': '0.8924474'} ptms {'AlphaLink2_AlphaLink-Multimer_SDA_v3.pt_78663_0.843': '0.8425047'}

What is a good Crosslink satisfaction? How many crosslinks are needed to improve the satisfaction and what can I infer from this "score"?

The output folder: .... AlphaLink2_99982_0.722.pdb
AlphaLink2_AlphaLink-Multimer_SDA_v3.pt_78663_0.843_best.pdb
AlphaLink2_AlphaLink-Multimer_SDA_v3.pt_78663_0.843_outputs.pkl.gz ...

When --save_raw_output, I only get the best model saved but I would like to save all models. In the code I think its intended to save all models but somewhere there might be a bug. I think the filename of the pickle file might be mixed up.

lhatsk commented 1 year ago

Hi, Cool that you are using AlphaLink2! Thanks for pointing out these issues!

Do you still have a trace of what went wrong with flash-attention?

The protobuf issue is still puzzling to me. I have locally protobuf==3.20.3 and it's working fine. Something seems to overwrite it later, a colleague of mine had the same issue just now. Will investigate.

Regarding crosslink satisfaction: What crosslinker are you using? What's the expected distance cutoff? By default, I used 25A for SDA. I added now an option --cutoff to change the default. Maybe that resolves it? Keep in mind that satisfaction is a binary metric and quite harsh. Maybe we should just remove it. I would suggest looking at the distances before/ after integrating crosslinks to assess the effect. The model confidence looks good! What do you get for AlphaFold-Multimer v3 without crosslinks?

Good crosslink satisfaction depends on your FDR, agreement with the co-evolutionary information, and if they cover multiple states. How many crosslinks do you have and what is the FDR? In general, more crosslinks will increase the impact/ effect but we have seen large effects even with a single crosslink.

I moved saving the raw input now inside the loop, so every model will be saved.

mcale6 commented 1 year ago

Thank you for your response.

flash-attention problem: I raised an issue here (https://github.com/HazyResearch/flash-attention/issues/280) It seems theres a compilation problem when compiling the pytorch lib and nivida lib.

choosing of cross-links: I use Xwalk trypsin digestion max distance 25A for inter-residues ( Lys, Ser, Thr, and Tyr. ) from a previously predicted complex with AF-m standard. From the output i take these residues to generate the crosslink file for Alphalink2 input (I only got 2). I choose FDR to be 20% but this is not so clear to me how to define this. I guess one could just pertubate this to get different sampling.

I get similar high confidence as with the standard run but not the same structure. Some thoughts: Lets say the max distance is 17A for a crosslinked pair of residue in the true complex. In the prediction i could have a higher or lower distance (lets say at least 5A). Is this difference penalised, something like in Alphalink1 where restraints were implemented?

Some more general questions: Do you think that by defining only inter-residue crosslinks the satisfaction score will be depending more/only on the paired information in the MSA? Can I hypothesis that the crosslinks will put an emphasis on which co-evolutionary information in the MSA should focus. Does it matter when the outer product (In Af-m they switched it at the beginning) is done between pair and msa representation?

lhatsk commented 1 year ago

choosing of cross-links: I use Xwalk trypsin digestion max distance 25A for inter-residues ( Lys, Ser, Thr, and Tyr. ) from a previously predicted complex with AF-m standard. From the output i take these residues to generate the crosslink file for Alphalink2 input (I only got 2). I choose FDR to be 20% but this is not so clear to me how to define this. I guess one could just pertubate this to get different sampling.

Interesting, so the links which were previously satisfied in the AF-m prediction are no longer satisfied? How off are the crosslinks? What are you hoping for? My guess would be with such a large model confidence, the influence will be small. I would also reduce the FDR to 0.05 and maybe increase recycling to 20. We use 3 iterations by default, your v3 run will likely have had 20.

I get similar high confidence as with the standard run but not the same structure.

Does the prediction make sense? Variability might stem from different MSA subsamples (if your MSAs are large enough) and the additional training, also other types of models. Does the AF-m model_1 also look different?

Some thoughts: Lets say the max distance is 17A for a crosslinked pair of residue in the true complex. In the prediction i could have a higher or lower distance (lets say at least 5A). Is this difference penalised, something like in Alphalink1 where restraints were implemented?

No, it wouldn't be penalised. Atm it's only cutoff-based, so both predictions would be perfectly fine. We will provide a model with distograms in the future to allow more flexibility.

Some more general questions: Do you think that by defining only inter-residue crosslinks the satisfaction score will be depending more/only on the paired information in the MSA?

Sorry, I am not quite sure I understand. The satisfaction depends on the agreement of co-evolutionary and crosslinking information in the end. For inter-residue crosslinks, this would be the paired information.

Can I hypothesis that the crosslinks will put an emphasis on which co-evolutionary information in the MSA should focus. Yes, since they bias the retrieval.

Does it matter when the outer product (In Af-m they switched it at the beginning) is done between pair and msa representation?

I don't think it does because in the worst case the information exchange is only delayed by one layer, but maybe worth checking out.

sami-chaaban commented 1 year ago

The protobuf issue is still puzzling to me. I have locally protobuf==3.20.3 and it's working fine. Something seems to overwrite it later, a colleague of mine had the same issue just now. Will investigate.

I am also having this issue unfortunately

lhatsk commented 1 year ago

I see that Uni-Core is overwriting the version. I changed the order of installation now.

pip install protobuf==3.20.1

should fix your problem. I still need to check if there is now a conflict with the two pytorch versions side-by-side

Edit: I updated the instructions. Works fine for me.

mcale6 commented 1 year ago

Yes i use many rec steps and high FDR rate. I played around with the crosslinks and I get different docking site. Im expecting this, as my target can have many possible interaction sites, which I also get from AF-m. My goal would be to sample just one conformation for a given set of crosslinks but I guess the variability of the sampled MSA is counterproductive in my case. Is the MSA subsampled by default? Also on a sidenote: For complex the Neff does not correlate with the prediction accuracy like in monomers. (3.8 in https://www.biorxiv.org/content/10.1101/2023.05.16.541055v1.full)

Cross-linking information is related to the pairing information. (that was what i thought)

Incorporating distogramms would be awesome :)

lhatsk commented 1 year ago

Yes i use many rec steps and high FDR rate. I played around with the crosslinks and I get different docking site. Im expecting this, as my target can have many possible interaction sites, which I also get from AF-m. My goal would be to sample just one conformation for a given set of crosslinks but I guess the variability of the sampled MSA is counterproductive in my case. Is the MSA subsampled by default?

How large are your MSAs? AlphaFold always subsamples MSAs exceeding max_msa_clusters: https://github.com/Rappsilber-Laboratory/AlphaLink2/blob/main/unifold/config.py#L648C12-L648C12 https://github.com/deepmind/alphafold/blob/main/alphafold/model/config.py#L231

The rest is aggregated in the ExtraMSAStack. This has to be done to constrain memory. You could try to increase max_msa_clusters to see if it helps and still fits. I haven't played much around with it though to contain the variance.

I believe because of the way we trained AlphaLink2, the anchoring effect of crosslinks that we observed in AlphaLink1 is less pronounced. I hope to increase it in the future. Still, the sampling should be more focused but will also depend on the number of links. Maybe going up to 30 A in your XWalk simulation adds additional support.

Also on a sidenote: For complex the Neff does not correlate with the prediction accuracy like in monomers. (3.8 in https://www.biorxiv.org/content/10.1101/2023.05.16.541055v1.full)

Thanks for pointing this out!

Cross-linking information is related to the pairing information. (that was what i thought)

Yes, absolutely!

mcale6 commented 1 year ago

Its exceeds max_msa_clusters. The clustering of the sequences should yield very similar/same results in the Alphafold Pipeline for different runs. What I thought was that AlphaLink2 will subsample to Neff=10. But this seems not the case, which is good in my case.

Yes I will try to shuffle and change the distances with the XWalk simulation.

Thanks :)

lhatsk commented 1 year ago

No, we currently do not subsample the MSAs in AlphaLink2.

mcale6 commented 1 year ago

so the protobuf seems still to give an error for me. After trial and error this worked for me without setting export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (this I needed to do before with the standard installation suggested)

conda env with python3.10

I replaced pytroch installation with: conda install pytorch=2.0 pytorch-cuda=11.8  -c pytorch -c nvidia

for unicore i use the latest release: pip install https://github.com/dptech-corp/Uni-Core/releases/download/0.0.3/unicore-0.0.1+cu118torch2.0.0-cp310-cp310-linux_x86_64.whl

remove protobuf==3.20.1 in: pip install absl-py==1.0.0 biopython==1.79 chex==0.0.7 dm-haiku==0.0.9 dm-tree==0.1.6 immutabledict==2.0.0 jax==0.3.25 ml-collections==0.1.0 numpy==1.23.3 pandas scipy tensorflow-cpu

lhatsk commented 1 year ago

Thanks! I will test it. We may be able to remove that pip altogether and let AlphaFold handle the dependencies if the newer protobuf is no longer an issue.

lhatsk commented 1 year ago

I updated the instructions once more. They are much simpler now. Everything is working smoothly on my end. Thanks for your input and patience! I will look into the flash-attention issue. Looks like the cuda packages are runtime only.

lhatsk commented 1 year ago

I managed to compile and install flash-attention within the conda environment with the following packages:

conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc cuda-cudart-dev libcusolver-dev libcublas-dev libcufft-dev libcusparse-dev libcublas-dev

I compiled it directly on the node with the A100.

You might also need conda install pytorch-cuda=11.8 -c pytorch -c nvidia. I installed it at the beginning but not sure if it is actually required. Will try it once more from scratch.

Set CUDA_HOME then to your conda environment, i.e., install flash-attention with:

CUDA_HOME=YOUR_PATH/conda/envs/alphalink python setup.py build -j 8 install