Rappsilber-Laboratory / AlphaLink2

AlphaLink2: Integrating crosslinking MS data into Uni-Fold-Multimer
Creative Commons Attribution 4.0 International
42 stars 11 forks source link

Homo-oligomer prediction? #8

Open heejongkim opened 11 months ago

heejongkim commented 11 months ago

Hi, Thanks for releasing a fantastic package to the scientific community. I just started testing with the example inputs to understand the input requirements and formats.

Here's my primary question: During the test, I got stuck with how to format the input files, fasta and crosslinking data, for homodimer or homo-oligomer prediction. Have you tried or design the package for this type of cases?

And my side question is: For the input, do I have to follow "A", "B", "C" naming scheme or I can be flexible on that? I tested a few different ways but none worked very well.

Thank you very much.

best, heejong

lhatsk commented 11 months ago

Hi Heejong,

We focused on heteromeric assemblies in this release, since homomers pose a different challenge. Nevertheless, you can predict homomers. We cannot distinguish intra- and inter-protein links in this case, therefore you would just define them as self-links:

5 A 15 A 0.1

If, however, you would like to only include them as inter-chain links, it gets a little more complicated.

You would either need to replicate the features, say you have a homo-dimer, AlphaLink will generate A.feature.pkl.gz and A.uniprot.pkl.gz. You could copy them to B.feature.pkl.gz and B.uniprot.pkl.gz and adjust chains.txt from A A to A B. Now you can include the inter-chain links as

5 A 15 B 0.1

Or just ignore intra-chain links altogether by inserting here: https://github.com/Rappsilber-Laboratory/AlphaLink2/blob/main/unifold/dataset.py#L153

if i == j: continue

Note that you would need to run python setup.py install again afterwards to propagate the changes.

And my side question is: For the input, do I have to follow "A", "B", "C" naming scheme or I can be flexible on that? I tested a few different ways but none worked very well.

At the moment, you would need to adhere to the A,B,C,... naming scheme. Uni-Fold internally maps the sequence in order to A,B,C,... The final mapping can be found in "chain_id_map.json" in the output directory.

What went wrong in your case? What would you prefer, just using the sequence id from the FASTA? The generic naming scheme makes it easier, esp., for homo-multimeric targets.

Hope this helps, Kolja

/edit updated the code snippet to conform with the recent update.

heejongkim commented 11 months ago

Hi Kolija, Thanks for the guidance. I will give it a shot and get back to you soon.

For the part that I got error was more like naming scheme in filename. e.g.) my filename was Protein1_Portein2.fasta, which has entries of >Protein1 and >Protien2 So, Alphalink ended up facing two choices Protein1.fasta and Protein1_Protein2.fasta and that might've caused the issue.

best, heejong

lhatsk commented 11 months ago

I fixed the handling of FASTA filenames with multiple underscores, which hopefully also resolves your issue.

heejongkim commented 11 months ago

Awesome. Much appreciate it. Will give it a shot!

heejongkim commented 5 months ago

Hi Kolja,

I'm finally circling back to this matter.

I'm actively testing the homodimer situation right now but, in the meantime, I got another more complex situation.

What if you have 5 subunit complex, consisting of homodimer and homotrimer and they all interact each other? I've thought about it but I feel like I may inflate the ambiguous information too much to hinder the inference. If you have any suggestions towards proper setup for the inference, that would be awesome.

Thank you so much.

best, heejong

lhatsk commented 5 months ago

How many links do you have per interaction? I usually just keep them, the network seems to be able to deal with it fairly well. If the results are bad, remove the homomeric links as suggested here: https://github.com/Rappsilber-Laboratory/AlphaLink2/issues/8#issuecomment-1645414093