YoshitakaMo / localcolabfold

ColabFold on your local PC
MIT License
560 stars 128 forks source link

custom heterdimer multiple alignment format #259

Open fglaser opened 4 days ago

fglaser commented 4 days ago

Dear all,

I know this topic has been addressed but I cannot make it right.. I need to merge to custom made a3m files of an heterodimer to run colabfold_batch in multimer mode.

What I tried is an alignment that starts with

174,157 1,1

00000|sp|P38879|NACA_YEAST_1 MSAIPENANVTVLNKNEKKARELIGKLGLKQIPGIIRVTFRKKDNQIYAIEKPEVFRSAGGNYVVFGEAKVDNFTQKLAAAQQQAQASGIMPSNEDVATKSPEDIQADMQAAAEGSVNAAAEEDDEEGEVDAGDLNKDDIELVVQQTNVSKNQAIKALKAHNGDLVNAIMSLSK 00001|UniRef90_A0A540LHD7/58-200_1 --------EASKQSRSEKKSRKAMLKLGMKPVTGVSRVTIKRTKNILFFISKPDVFKSPnSDTYVIFGEAKIEDLSSQLQ---TQAAQQFRMPDMSSVMGK------------PEISAAAAGAQDEEEEEVDETGVEPRDIDLVMTQAGVSRSKAVKALKTHSGDI--------- 00002|UniRef90_A0A540LHD7/250-405_1

and fter 15000 sequences of the first protein

0000|sp|Q02642|NACB1_YEAST_2 MPIDQEKLAKLQKLSANNKVGGTRRKLNKKAGSSAGANKDDTKLQSQLAKLHAVTIDNVAEANFFKDDGKVMHFNKVGVQVAAQHNTSVFYGLPQEKNLQDLFPGIISQLGPEAIQALSQLAAQMEKHEAKAPADAEKKDEAIPELVEGQTFDADVE 0001|UniRef90_A0A103Y6K5/46-195_2 -KMNVEKLMKMA---GAVRTGGKGSMRRKKKAIHKTTTTDDKRLQSTLKRIGVTAITQIEEVNIFKDE-TVIQFLNPKVQAAIGANTWVVSGSPQTKQLQDILPGILNQLGPDNLDNLRKLAEQFQKQapgagEGIAAaAAAQEDDDEVPELVAG-------- 0002|UniRef90_A0A103Y6K5/240-395_2 -KMNVEKLMKMA---GAVRTGGKGSVRRKKKAVHKTTTTDDKRLQSTLKRIGVNAIPAIEEVNIFKDE-TVIQFLNPKVQASIAANTWVVSGSPQTKKLQDILPGILNQLGPDNLDNLRKLAEQFQKQapgagEGTAATTAQEDDYEVPELVAGETFEAAA- 0003|UniRef90_A0A4Y7KDX3/1-145_2 MKMNRDKLMKMA---GAVRTGGKGSVRRKKKAVHKTATTDDKRLQSTLKRVGVNAIPAIEEVNIFKDDS-VIQFLNPKVQASIAANTWVVSGSPQTKKLQDILPGIINQLGPDNLDNLRKLAEQFKKQgAGAAAaAAQEDDDDDVPELM---------- 0004|UniRef90_A0A4Y7KDX3/145-245_2 --MNIEKLQKMA---GAVRTGGKGSVRRKKKAVHKTTTTDDKRLQSTLKRIGVNAIPAIEEVNIFKDDV-VIQFQNPKVQASIAANTWVVSGSPQTKIFVQFVDHIL-------------------------------------------------- 0005|UniRef90_A0A4Y7KDX3/279-341_2

But I get the following error when running

:::::::::::::: NACA_NACB1.uniref90.mgnify.bfd_small.merged.log :::::::::::::: 2024-09-19 11:27:10,318 Running colabfold 1.5.5 (4e198f5cecc6a808daa6baf7441899e5e76f7b9e) 2024-09-19 11:27:13,934 Running on GPU 2024-09-19 11:27:14,413 Found 5 citations for tools or databases 2024-09-19 11:27:14,413 Query 1/1: NACA_NACB1.uniref90.mgnify.bfd_small.merged (length 174) 2024-09-19 11:27:14,426 Could not get MSA/templates for NACA_NACB1.uniref90.mgnify.bfd_small.merged: list index out of range Traceback (most recent call last): File "/home/fabian/localcolabfold/colabfold-conda/lib/python3.10/site-packages/colabfold/batch.py", line 1472, in run = unserialize_msa(a3m_lines, query_sequence) File "/home/fabian/localcolabfold/colabfold-conda/lib/python3.10/site-packages/colabfold/batch.py", line 1138, in unserialize_msa paired_msa[j] += ">" + header_no_faster_split[j] + "\n" IndexError: list index out of range 2024-09-19 11:27:14,428 Done

Any help will be greatelly apreciated, Fabian

YoshitakaMo commented 1 day ago

Are your complex predictions being performed correctly when you provide a FASTA file as input? If correct, the issue may lie in your input a3m file.

Preparing an a3m file for complex prediction by hand is very painful. However, if LocalColabFold runs successfully with a FASTA file as input, MSA files (in a3m format) for the pair and monomers will be generated in subdirectories within the output directory. If these files are present, colabfold_batch will skip obtaining the MSA through the MSA server.

If you remove all files except the MSA files subdirectories in the output directory, colabfold_batch with the same FASTA input file will restart the structure prediction without retrieving MSAs again.

By manually modifying the monomer and paired MSA files left in the subdirectories within the output directory, correctly formatted input a3m files will be generated automatically in the output directory (not its subdirectory!), and complex structure prediction will start using it.

Unfortunately, my server is currently under maintenance, so I am unable to provide detailed instructions now.

fglaser commented 1 day ago

Dear Yoshitaka,

Thanks a lot for your kind answer.

As you suggested when I restarted the new run with the same input fasta and having deleted the main output but respected the subdirectories I indeed got a rerun without recomputing the msa. Also puzzling is that the total number of homologues in all a3m in the subdirectories is different and lower that those on the main dir a3m (which is correct in both runs).

So the process you suggested works but honestly don't understand how to exactly manipulate the subdirectories a3m to use my custom alignments instead of the existing ones created by default. There are two a3m in env/ (uniref.a3m and bfd_mg...a3m) and this pair.a3m in pairgreedy/, which I dont' understand exactly how to manipulate.

I would be very happy for more details of how to proceed to use my custom heterodimer multiple alignment in the subdirectories when possible.

Thanks a lot again,

Fabian