LinearFold / LinearTurboFold

An end-to-end linear-time algorithm for structural alignment and conserved structure prediction of RNA homologs
Other
11 stars 5 forks source link

bus error #2

Closed kad-ecoli closed 3 years ago

kad-ecoli commented 3 years ago

I have a set of 32 rRNA sequences ranging from 2227 to 1231 nucleotides that I want to align by LinearTurboFold. 4v8mAA.fasta.zip Unfortunately, LinearTurboFold complaint about bus error.

 25% [=============                                     ] -                     sh: line 1: 58389 Bus error               /gpfs/ysm/project/pyle/cz378/LinearTurboFold/bin/linearturbofold /gpfs/ysm/project/pyle/cz378/all_c1.0_s1.0/mTMalign/fasta/4v8mAA.fasta /gpfs/ysm/project/pyle/cz378/all_c1.0_s1.0/mTMalign/LinearTurboFold/4v8mAA 100 100 3 0 0 0 3 1 0.

Is this just an issue my physical memory limit, or is it an inherent issue of LinearTurboFold? Thank you.

kad-ecoli commented 3 years ago

4v6wA5.fasta.zip Here I found another set of rRNA sequences that have a similar issue.

sizhen commented 3 years ago

Hi Chengxin. Thanks for sharing your data with me. I started to run LinearTurboFold on 4v8mAA.fasta on both Mac and Linux right after you sent the data to me. Both programs are still running (about 2.3 hours) and haven't met the same error. So I think it's highly possible that the bus error is due to the memory issue.

LinearTurboFold's both runtime and memory usages scale quadratically with the number of sequences k because it calcualtes marginal alignment probs for pairwise sequences in each iteraction and the match score computation and extrinsic information calcuatione (help to improve the alignment and folding accuracies, respectively) also take O(k^2n) time.

The 4v8mAA.fasta dataset takes about 5.4 GB space so far and the memory won't increase a lot in the end. I will share you the results after the program finishes.

Mac machine: macOS Big Sur 11.0.1, 2.6 GHz 6-Core Intel Core i7, 16 GB 2400 MHz DDR4; Linux machine: CentOS 390 7.7.1908, 2.30 GHz Intel Xeon E5-2695 v3 CPU, 755 GB.

Hope these information can help you. Please let me know if you have more questions.

kad-ecoli commented 3 years ago

Yes, you are right. When I tried the program on another computer with larger memory, it can generate results on 4v8mAA.fasta after 1089m17.684s. I will check if I can generate result for 4v6wA5.fasta on a hardware.

kad-ecoli commented 3 years ago

Looks like I close the issue too early. I check 4v8mAA alignment again, and found that LinearTurboFold alignment is obvious problematic, with the majority of nucleotides not being aligned in the final alignment. 4v8mAA.aln.zip On average, the alignment coverage is only 0.0360, while MAFFT, MAFFT-xinsi and clustalo all achieves alignment coverage of 0.93-0.96 on the same set of sequences. Therefore, something is wrong with the underlying LinearTurboFold alignment algorithm.

sizhen commented 3 years ago

Thanks for reporting this issue!

I found that I made a silly problem that the current code can only handle uppercase input nucleotides, which is pretty dumb... I fixed it by capitalizing all the nucleotide letters when LinearTurboFold reads the input sequences. (LinearTurboFold inherits some code modules from RNAstructure which treats lowercase nucleotides as constraints during folding and makes them unpaired in predicted structures. Currently, LinearTurboFold doesn't support constraint folding. )

Another key issue may affect the alignment accuracy is the beam size for alignment. We observed that a small beam size can not handle long indels especially when the range of the sequence length is large. You can set the alignment beam size larger (set alignment beam size like 500: --b1 500, or infinite : --b1 -1) to capture long indels.

Bests, Sizhen

kad-ecoli commented 3 years ago

I suspect the upper vs lower case is not the main reason for the poor performance on this specific case. I have tested LinearTurboFold on both lower case sequence and upper case sequence, and it seems to generate identical result, so long as there is not mixture of upper and lower cases. Nonetheless, I will try to pull and recompile your latest commit and see if the result improves.

sizhen commented 3 years ago

I tested on a subset (with only the first five sequences) and the alignments are different.

kad-ecoli commented 3 years ago

In that case, probably I did not test extensive enough. I will re-run the program.

kad-ecoli commented 3 years ago

You are right. After compiling your latest commit and applying the --b -1 option, LinearTurboFold result is indeed significantly improved, and is similar to MAFFT-xinsi on my benchmark excluding the two aforementioned cases with large rRNAs. I will run the new program on 4v8mAA.fasta and 4v6wA5.fasta and let you know the results.

In any case, I would suggest trying to make the beam size option a little more intelligent, e.g., automatically increase it based on sequence length differences.

kad-ecoli commented 3 years ago

The new LinearTurboFold indeed solved the issue for 4v8mAA.fasta. The final alignment has an average alignment coverage of 0.943 and TM-scoreRNA of 0.859, which are all good.