Rappsilber-Laboratory / AlphaLink2

AlphaLink2: Integrating crosslinking MS data into Uni-Fold-Multimer
Creative Commons Attribution 4.0 International
42 stars 11 forks source link

Meaning of "crosslink satisfaction"? #22

Open joshwa-s opened 3 months ago

joshwa-s commented 3 months ago

I am wondering what the meaning of "Crosslink satisfaction" is and how we might be able to interpret this. We have a dataset with multiple different crosslinkers, therefore some XLinks have different distance constraints. Additionally, there is no part of the input that takes into account the distance constraint(s). Therefore, I am wondering how a XLink is able to be classified as "satisfied."

On a slightly separate note, I wonder how the data is integrated for multimer models. If a XLink is possible for multiple different chains, how is that handled? (e.g., in a protein1(3 subunits)-protein2(3 subunits) heterohexamer, a given XLink between protein 1 and 2 can be satisfied by linking prot1chain1 to prot2chain1, prot1chain1 to prot2chain2 , ... ). So in this case I also wonder how it is determined if a Xlink is satisfied.

lhatsk commented 3 months ago

I am wondering what the meaning of "Crosslink satisfaction" is and how we might be able to interpret this. We have a dataset with multiple different crosslinkers, therefore some XLinks have different distance constraints. Additionally, there is no part of the input that takes into account the distance constraint(s). Therefore, I am wondering how a XLink is able to be classified as "satisfied."

Crosslink satisfaction (at the moment) indicates the number of crosslinked residue pairs in the interface that are below the cutoff (25A). The inference script has a --cutoff argument which defaults to 25 A because we trained on sulfoSDA. Since the network is optimised for SDA crosslinks that's what we assume at the moment, even though it will also work ok-ish with DSSO and we also have another photoAA network (expecting 10 A). What are the different expected distances in your crosslink set?

On a slightly separate note, I wonder how the data is integrated for multimer models. If a XLink is possible for multiple different chains, how is that handled? (e.g., in a protein1(3 subunits)-protein2(3 subunits) heterohexamer, a given XLink between protein 1 and 2 can be satisfied by linking prot1chain1 to prot2chain1, prot1chain1 to prot2chain2 , ... ).

The crosslinks would be duplicated internally, ie, same crosslink for all homomers. We will likely change the behaviour in the future to give more control, ie, specify directly between which chains the crosslink is supposed to be applied. That would mean you would need to duplicate the link manually for homomers.

So in this case I also wonder how it is determined if a Xlink is satisfied.

This is not reflected at the moment in the crosslink satisfaction, if you have 4 ambiguous links ideally the satisfaction would top out at 25%. It's easiest then to compute your satisfaction based on the prediction (PDB) afterwards.

joshwa-s commented 3 months ago

Thank you for the response. Our XLink dataset includes a number of different crosslinker distances. Most are between 25-30 A, but there are a couple at 40 and 50 A, respectively. Ideally each Xlink specified by chain1, chain2, pos1, pos2, FDR, would also take the maximum distance constraint. This way we could use the entirety of the XLink dataset to inform the model building. I am sure this is challenging to implement and I apologize if it sounds ungrateful, we are very impressed by the prospect of Xlink-informed model generation and are most interested in its progression. I simply want to describe our use case for your consideration in future development. Again, many thanks.

lhatsk commented 3 months ago

So for now I would just lump them all in and see what happens. Unfortunately, I think that 40-50A crosslinks will not be super helpful for modelling. They are too far off from the evolutionary information. You will need to be able to bring the structures reasonably close for other information to take over. We have also seen it with DSSO crosslinks that sit at roughly 30 A, it usually can bring the structures closer but not close enough to be able to build an interface later on.

In principle we have done the work for supporting more long-range crosslinks. We have a distogram model that can take arbitrary experimental distance restraints (although maxing out at 42 A right now) but we haven't released it yet because it requires some changes to the codebase and is not fully trained so it will not perform as well. I will see if I release it (experimentally) when I find the time.