LinearFold / LinearTurboFold

An end-to-end linear-time algorithm for structural alignment and conserved structure prediction of RNA homologs
Other
11 stars 5 forks source link

mafft benchmark in LinearTurboFold preprint #3

Closed kad-ecoli closed 3 years ago

kad-ecoli commented 3 years ago

I am afraid the MAFFT benchmark performed in the LinearTurboFold is not completely fair. MAFFT not only has pure sequence alignment mode, it also has two extensions MAFFT-xinsi and MAFFT-qinsi which are designed for simultaneous RNA folding and alignment. Since LinearTurboFold also performs simultaneous folding and alignment, it is more fair to compare to MAFFT-xinsi and MAFFT-qinsi, rather than default MAFFT without extension.

By the way, while LinearTurboFold is indeed fast for "flat" MSAs (i.e. MSAs with long sequence length but small sequence number), it does not scale well for "tall" MSAs (i.e. MSAs with short sequence length but large sequence number). For the latter case, even MAFFT-xinsi can significantly outperforms LinearTurboFold in terms of run time. This does not undermine the significance of LinearTurboFold; the two approaches just have different advantages for different problems.

All benchmarks performed in the LinearTurboFold paper evaluate the accuracy of MSA in terms of the secondary structure prediction. It may also be of interest to check the agreement between LinearTurboFold alignment and 3D structure alignments, such as those (shamelessly) generated by RNA-align. In such a benchmark, in terms of RMSD and TM-score of the alignment, LinearTurboFold does not outpeform MAFFT-xinsi, or even pure alignment method such as clustalo and default MAFFT without extension.

Despite my above statements though, the major conclusions of the LinearTurboFold paper still hold, i.e., it is a fast and accurate program for folding multiple unaligned long sequences. Nonetheless, there are still a lot of room for further improvements.

sizhen commented 3 years ago

Thanks for your comments!

For the benchmark methods used in the paper, we grouped them into three categories: Sankoff-style methods of jointly folding and alignment, single sequence folding, and sequence-level alignment. The third group does not consider any structural information and MAFFT (without any extension) is in the third group. MAFFT-xinsi and MAFFT-qinsi methods don't belong to any group because they incorporate structural information but don't predict structures (please let me know if I am wrong). But you are right, we should compare with all the benchmark methods comprehensively in terms of either folding or alignment.

Yes, LinearTurboFold is scalable to long sequences but not for a large set of sequences. The scalability to sequence length helps it feasible to fold full-length coronavirus genomes without any constraints. LinearTurboFold is relatively slow for tall MSAs, which is a limitation of the current LinearTurboFold project. LinearTurboFold is an iterative framework and each iteration involves folding each sequence and aligning all pairwise sequences (O(k^2n)), which helps to refine the accuracy iteratively but also makes it slow. We are considering some tricks to improve the runtime from both program and algorithm's perspectives. Currently, for tall MSAs, our suggestion is to sample the most diverse sequences as a representative subset and run LinearTurboFold on the subset.

3D structure alignment is a really good point and noteworthy. I found that you posted an alignment problem in another Issue channel, and I realized that I made a dumb mistake and it affected both alignment and folding results if the input sequences are lower cases. May this issue affect your evaluation? I would appreciate it if you could share with me more information about your benchmark dataset.

Bests, Sizhen

kad-ecoli commented 3 years ago

MAFFT-xinsi and MAFFT-qinsi indeed predict RNA secondary structure, even though the predicted secondary structure are not printed to the output by default. See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2387179/ for more description.

kad-ecoli commented 3 years ago

dataset.zip Here is the dataset I used for benchmark, consisting of 264 PDB chains clustered into 31 structure clusters. Only " C3'" atoms are retained for PDB file because that is the only needed atom for RNA-align.

sizhen commented 3 years ago

MAFFT-xinsi and MAFFT-qinsi indeed predict RNA secondary structure, even though the predicted secondary structure are not printed to the output by default. See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2387179/ for more description.

Thanks for pointing out this. It's necessary to compare with them.

sizhen commented 3 years ago

dataset.zip Here is the dataset I used for benchmark, consisting of 264 PDB chains clustered into 31 structure clusters. Only " C3'" atoms are retained for PDB file because that is the only needed atom for RNA-align.

Thanks!

kad-ecoli commented 3 years ago

The latest LinearTurboFold address much of the issues, making its performance better than MAFFT and clustalo, while comparable to MAFFT-xinsi on my benchmark.