epruesse / SINA

SINA - Reference based multiple sequence alignment
https://sina.readthedocs.io
GNU General Public License v3.0
41 stars 4 forks source link

Base shift in copied alignment #94

Closed jgerken closed 4 years ago

jgerken commented 4 years ago

We have observed a base shift in an alignment that is copied from the reference alignment. It only affects three bases in the alignment positions 22045 to 22050 but it is still strange as SINA says the alignment is copied from template sequence. While creating a minimal working example, I was able to pin down the problem a bit more. It only happens when the reference ARB file contains a HELIX filter.

Test case 1

Result

alignments are 100% identical

Test case 2

Result

the base shift in alignment positions 22045 to 22050 happens:

UGU--- (original)
-UG--U (re-aligned)

Thus, SINA seems to apply secondary structure corrections even when copying the alignment of a reference sequence. Is this expected behaviour or a bug? If it is expected behaviour it would be nice if SINA states that it not only copied the alignment but also adopted it to match secondary structure - at the moment it is a bit odd that it says copied but actually changed the alignment.

I will also raise the issue in the next team meeting, so that our biological experts have a look. For me, it feels odd that SINA and our manually curated alignment do not agree on the correct position if the secondary structure is applied.

Details

I used SINA 1.7.1 for the tests. I have attached the files:

realignment-problem.tar.gz

Commands used

To create the reference ARB file from the FASTA file (used in test case 1):

sina -i AY196674.fasta --prealigned -o AY196674.arb
Test case 1
sina --db AY196674.arb --intype fasta --in AY196674.fasta --overhang attach --turn all --lowercase unaligned --search-filter-lowercase --insertion forbid --fs-min 40 --fs-max 40 --fs-msc 0.7 --fs-weight 1 --fs-req 1 --fs-req-full 1 --fs-full-len 1400 --gene-start 0 --gene-end 0 --fs-cover-gene 0 --match-score 2 --mismatch-score -1 --pen-gap 5 --pen-gapext 2 --fs-kmer-len 10 --fs-kmer-mm 0 --fs-kmer-no-fast  --out realigned-no-helix.fasta -vvv
Test case 1 with short sequence
sina --db AY196674.arb --intype fasta --in 47.fasta --overhang attach --turn all --lowercase unaligned --search-filter-lowercase --insertion forbid --fs-min 40 --fs-max 40 --fs-msc 0.7 --fs-weight 1 --fs-req 1 --fs-req-full 1 --fs-full-len 1400 --gene-start 0 --gene-end 0 --fs-cover-gene 0 --match-score 2 --mismatch-score -1 --pen-gap 5 --pen-gapext 2 --fs-kmer-len 10 --fs-kmer-mm 0 --fs-kmer-no-fast  --out 47-no-helix.fasta -vvv
Test case 2
sina --db ssu-seed-AY196674.arb --intype fasta --in AY196674.fasta --overhang attach --turn all --lowercase unaligned --search-filter-lowercase --insertion forbid --fs-min 40 --fs-max 40 --fs-msc 0.7 --fs-weight 1 --fs-req 1 --fs-req-full 1 --fs-full-len 1400 --gene-start 0 --gene-end 0 --fs-cover-gene 0 --match-score 2 --mismatch-score -1 --pen-gap 5 --pen-gapext 2 --fs-kmer-len 10 --fs-kmer-mm 0 --fs-kmer-no-fast  --out realigned-helix.fasta -vvv
Test case 2 with short sequence
sina --db ssu-seed-AY196674.arb --intype fasta --in 47.fasta --overhang attach --turn all --lowercase unaligned --search-filter-lowercase --insertion forbid --fs-min 40 --fs-max 40 --fs-msc 0.7 --fs-weight 1 --fs-req 1 --fs-req-full 1 --fs-full-len 1400 --gene-start 0 --gene-end 0 --fs-cover-gene 0 --match-score 2 --mismatch-score -1 --pen-gap 5 --pen-gapext 2 --fs-kmer-len 10 --fs-kmer-mm 0 --fs-kmer-no-fast  --out 47-helix.fasta -vvv
epruesse commented 4 years ago

This is really odd. I will have to look into it. The result changes when you just add/remove the HELIX filter? That would mean there is memory corruption going on. The HELIX filter has no impact on the alignment itself, whether copied or not.

jgerken commented 4 years ago

I found the issue. I exported the FASTA file accidentally from the reference dataset and not from the seed - that's why I should not have to ARB instances open on the same computer at the same time ;) In the reference dataset, the alignment is different from that in the seed due to an old SINA bug related to insertions (we discussed this bug some years ago via mail). This bug is solved with 1.6+ as far as I can see. Prior to that the alignments were some kind of erratic after the first sequence containing introns. But with the current SINA versions I get stable alignments of the complete reference dataset when repeating the alignments multiple time (wasn't the case in earlier versions). So, basically the bug I observed was not the bug that I described. The bug that I observed does not longer exist and the bug that I described never existed. I am sorry!

P.S.: We will use the most current version of SINA for our next release. Previously, the upgrade was not made because the argument that changes in the alignment would have to be evaluated. But having the proof that sequences from the seed have a different alignment in the reference dataset simply kills the argument. That simply can't be. And I am also very impressed by the performance improvements due to the replacement of the PT server with your internal search. The index file is also much smaller than the PT server file. So it is a win-win-win scenario for us :)

epruesse commented 4 years ago

:) Happy to hear that.

I have integration tests in tests/accuracy.test and tests/accuracy_kmer.test that ensure that the accuracy stays good with both search implementations. There is a little bit of wiggle, naturally, but I was very careful not to compromise alignment quality.

Not sure when the bug you mentioned was fixed. Since 1.5 I've added a ton of tests, both integration and unit, and fixed plenty of errors using a variety of tools to locate them. Very possible it was one of the memory corruption problems.