Rappsilber-Laboratory / AlphaLink2

AlphaLink2: Integrating crosslinking MS data into Uni-Fold-Multimer
Creative Commons Attribution 4.0 International
48 stars 16 forks source link

MSA Subsampling Feature #27

Open camillapaleari opened 4 months ago

camillapaleari commented 4 months ago

Hello there! Thanks a lot for this very useful tool! I'm an active user of both AlphaLink and AlphaLink2, and I appreciate all the work that has gone into them.

I have a question regarding MSA subsampling. In AlphaLink1, there is an input flag --neff that allows subsampling the number of effective sequences. This feature is handy for me because, without it, the crosslinks are not taken into consideration.

Is there a similar option implemented in AlphaLink2? If not, would it be possible to add this feature, or could you provide guidance on how to modify the code to achieve this?

Thank you very much for your help!

decortja commented 4 months ago

I am also interested in this, and more broadly in supplying my own, precomputed MSAs. Uni-Fold has the use_precomputed_msas flag for this purpose. Is it possible to add this flag in the AL2 code? Or is there already a way to do this that I have overlooked?

lhatsk commented 4 months ago

Hi,

We haven't used MSA subsampling in AlphaLink2 because it wasn't necessary for the most part during training and it has a much bigger impact in the multimer setting since it will affect both the monomers and the interface. Something I have used (only manually) is removing / zero-ing out columns in the MSA where I have crosslinks (see https://github.com/Rappsilber-Laboratory/AlphaLink2/issues/9#issuecomment-1853663412). This way the effect is much more localized and more pronounced in the interface since the information is much sparser to begin with.

At the moment, there is no option to do the neff subsampling AlphaLink2 but everything is there to do it with a bit of manual intervention. I will add this option in the next week or so.

Essentially what you need to do is add these imports to AlphaLink2/unifold/dataset.py:

from unifold.data.msa_subsampling import (
    get_eff,
    subsample_msa_sequentially,
)

and then do the subsampling here: https://github.com/Rappsilber-Laboratory/AlphaLink2/blob/ca6d5b3089b91269ec2a1e64b74e2b0d30e55c4b/unifold/dataset.py#L227

neff = np.random.randint(1,25) # or whatever neff you want
indices = subsample_msa_sequentially(all_chain_features['msa'],neff=neff)
all_chain_features['msa'] = all_chain_features['msa'][indices]
all_chain_features["deletion_matrix"] = all_chain_features["deletion_matrix"][indices]
all_chain_features['msa_mask'] = all_chain_features['msa_mask'][indices]

use_precomputed_msas is on by default. I think you only need to place your MSAs in the respective chain folders of your output directory and then remove the chain.feature.pkl.gz / chain.uniprot.pkl.gz files to recompute them.

lhatsk commented 3 months ago

I added an option to subsample the MSAs to a given Neff, see the updated README. I added a second option that removes only MSA information at the crosslinked residues. Both can be used in combination.

andreadiianni commented 3 months ago
alphalink_issue_MSA

Hello, related to MSA subsampling with the last version of the code I have this issue when running the command. Could you explain what is the issue behind this? Many thanks in advance for your support and help.

grandrea commented 3 months ago

Ciao, first it seems your database is missing pdb_seqres.txt file (or you are not pointing to it correctly), which should have been downloaded when you/your cluster people installed the alphafold databases.

Second, are you sure you are running the latest version of the whole package? the neff and dropout are not listed as arguments in inference.py, but recognised by run_alphalink.sh

andreadiianni commented 2 months ago

Hello, many thanks Andrea for the support and help. Issue solved :)