illegal characters in a3m file which block the inference process

TencentAI4S / tfold

open source code for Tencent tFold

Other

68 stars 9 forks source link

illegal characters in a3m file which block the inference process #5

Open lithces opened 7 months ago

lithces commented 7 months ago

Hi,

I am recently trying tfold on an internal sequence but failed.

I did some analysis and the issue is with the MSA step. The generated a3m file contains illegal characters. In my case it is 0x00 at the end of the a3m file.

I know it would be less productive because I can not provide the sequence. Still I am trying to ask for some hints on how could the illegal characters appears during the MSA step?

Thanks, Ruijiang

Cloud-Rambler commented 7 months ago

Thanks for your attention, can you provide the process of executing the code？

lithces commented 7 months ago

The MSA step is triggered according to the section "antibody-antigen complex" in the Readme file: python projects/tfold_ag/gen_msa.py --fasta_file=examples/fasta.files/myseq.fasta --output_dir=examples/myseq

In myseq.fasta I include two chain in antibody and one chain in the antigen.

There is no error shown up during the process, however in the following. "predict.py" there are messages claiming that key error in a dictionary. Where the problematic key is from the generated a3m from the MSA step in the end: 0x00.

wufandi commented 7 months ago

Thank you for your interest in our work. Regarding your question, in myseq.fasta, you only need to input the sequence information of the antigen. tFold-Ag does not need to search for the MSA of the antibody, which brings a significant speed advantage. Additionally, if you know the epitope of the antigen, you can also try to input the epitope information, which will bring a significant performance improvement.

bzhousd commented 6 months ago

I have the same issue, it seems that mmseqs uses 0x00 to separate different query sequences.

wufandi commented 6 months ago

tFold-Ag is currently unable to handle multi-chain antigen/multi-antibody input scenarios. The sequence used to construct the MSA is single chain antigen.