Closed jgerken closed 4 years ago
This is really odd. I will have to look into it. The result changes when you just add/remove the HELIX filter? That would mean there is memory corruption going on. The HELIX filter has no impact on the alignment itself, whether copied or not.
I found the issue. I exported the FASTA file accidentally from the reference dataset and not from the seed - that's why I should not have to ARB instances open on the same computer at the same time ;) In the reference dataset, the alignment is different from that in the seed due to an old SINA bug related to insertions (we discussed this bug some years ago via mail). This bug is solved with 1.6+ as far as I can see. Prior to that the alignments were some kind of erratic after the first sequence containing introns. But with the current SINA versions I get stable alignments of the complete reference dataset when repeating the alignments multiple time (wasn't the case in earlier versions). So, basically the bug I observed was not the bug that I described. The bug that I observed does not longer exist and the bug that I described never existed. I am sorry!
P.S.: We will use the most current version of SINA for our next release. Previously, the upgrade was not made because the argument that changes in the alignment would have to be evaluated. But having the proof that sequences from the seed have a different alignment in the reference dataset simply kills the argument. That simply can't be. And I am also very impressed by the performance improvements due to the replacement of the PT server with your internal search. The index file is also much smaller than the PT server file. So it is a win-win-win scenario for us :)
:) Happy to hear that.
I have integration tests in tests/accuracy.test
and tests/accuracy_kmer.test
that ensure that the accuracy stays good with both search implementations. There is a little bit of wiggle, naturally, but I was very careful not to compromise alignment quality.
Not sure when the bug you mentioned was fixed. Since 1.5 I've added a ton of tests, both integration and unit, and fixed plenty of errors using a variety of tools to locate them. Very possible it was one of the memory corruption problems.
We have observed a base shift in an alignment that is copied from the reference alignment. It only affects three bases in the alignment positions 22045 to 22050 but it is still strange as SINA says the alignment is copied from template sequence. While creating a minimal working example, I was able to pin down the problem a bit more. It only happens when the reference ARB file contains a HELIX filter.
Test case 1
Result
alignments are 100% identical
Test case 2
Result
the base shift in alignment positions 22045 to 22050 happens:
Thus, SINA seems to apply secondary structure corrections even when copying the alignment of a reference sequence. Is this expected behaviour or a bug? If it is expected behaviour it would be nice if SINA states that it not only copied the alignment but also adopted it to match secondary structure - at the moment it is a bit odd that it says copied but actually changed the alignment.
I will also raise the issue in the next team meeting, so that our biological experts have a look. For me, it feels odd that SINA and our manually curated alignment do not agree on the correct position if the secondary structure is applied.
Details
I used SINA 1.7.1 for the tests. I have attached the files:
AY196674.fasta
original alignment of the reference sequence as exported from the seedAY196674.arb
ARB file created in test case 1ssu-seed-AY196674.arb
ARB file created in test case 247.fasta
partial sequence that is 100% identical to a part of the reference sequencerealignment-problem.tar.gz
Commands used
To create the reference ARB file from the FASTA file (used in test case 1):
Test case 1
Test case 1 with short sequence
Test case 2
Test case 2 with short sequence