castcollab / tesserae2

Tesserae2: Fast recombination-aware global and local alignment.
Other
3 stars 0 forks source link

Strange alignments with when using multiple targets, other anomalies #31

Open jonn-smith opened 4 years ago

jonn-smith commented 4 years ago

@winni2k @karljohanw I've run into several strange cases with the CAR-T data I've been running and they're now blocking me from continuing my analysis. I think the problems are all related and the issue involves either how the graph is being constructed or how the alignments are being produced. I haven't dived into the graph code to debug it myself.

I also found that I needed to make a couple of changes to the (that I think are side-effects of what I've found here) in order to properly report this bug. I have put them (and other updates) in my branch jts_adding_tests (my hope was to add these as unit tests to prevent this from happening in the future once it's fixed, hence the name). All the test data referenced below can be found in that branch.

Here are my findings:

  1. I've noticed that for some input sequences the results contain adjacent alignments that are from the same target (alignment1.end+1 = alignment2.start). Logically these two adjacent alignments should never happen and it should be a single alignment that spans from alignment1.start to alignment2.end. In my tool's code I detect and merge them in this case, but I don't think it should actually happen.

For example:

TesseraeAlignmentResult(seq_name='constant_region_4_linker_3', alignment_string='                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          tagagtcagggagggtcgagccattgaagggcacaaaaggaaactca', target_start_index=0, target_end_index=46)
TesseraeAlignmentResult(seq_name='constant_region_4_linker_3', alignment_string='                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         CCCTAACTGTAAAGTAATTGTGTGTTTTGAGACTATAAGTATCCCTTGGAGAACCACCTTGTTGG------------------', target_start_index=47, target_end_index=111)

Produced by:

tesserae -v align -r tests/resources/cart_test_read.fasta -s tests/resources/constant_regions.fasta
  1. Some alignments that are produced have no actual bases in them. Specifically I am seeing things like this:
TesseraeAlignmentResult(seq_name='Degron16_c60_as_read', alignment_string='TTCAATGTCTTAATGGTTCATAAGCGAAGCCATACTGGTGAACGCCCACTGCAGTGCGAGATCTGCGGCTACCAGTGCAGAAGAAAGGATATAATGAGCTACCACGTGAGAAGCCACACAGGGGAAAAAACCTTTTAAGTGTCACCTCTGCAACTATGCATGCCAAAGAAGAGATGCGCTC', target_start_index=0, target_end_index=180)
2020-04-14 13:42:46,864 tesserae.command.align DEBUG        TesseraeAlignmentResult(seq_name='PAC_BIO_LEFT_ADAPTER', alignment_string='T', target_start_index=1, target_end_index=1)
2020-04-14 13:42:46,864 tesserae.command.align DEBUG        TesseraeAlignmentResult(seq_name='Degron17_c60', alignment_string=' ', target_start_index=1, target_end_index=1)
2020-04-14 13:42:46,864 tesserae.command.align DEBUG        TesseraeAlignmentResult(seq_name='Degron17_c60', alignment_string=' ---', target_start_index=1, target_end_index=1)
...

Produced by:

tesserae -v align -r tests/resources/cart_test_read.fasta -s tests/resources/constant_regions.fasta

Alignment 1 is the query sequence. Alignments 3 and 4 have no actual content and their start and end coordinates are equal (the equal coords indicates to me that the alignment should have exactly 1 base in it as seen in alignment 2). There seem to be other strange issues with the start and end coordinates that are similar to this (e.g. end position = 0 and start position = 1) which seem like they should never happen.

  1. I have a sub-sequence of a read that I'm using as a query and I'm aligning several targets to it. I know that one of the targets matches almost exactly to this region (and I've verified with the reference C++ implementation).

When I align this sub-sequence and the matching target as a pair (i.e. the query fasta file has only one sequence in it and the target fasta file has only one sequence in it), I get the correct answer out of Tesserae (aligned for ease of comparison):

TesseraeAlignmentResult(seq_name='Degron16_c60_as_read', alignment_string='TTCAATGTCTTAATGGTTCATAAGCGAAGCCATACTGGTGAACGCCCACTGCAGTGCGAGATCTGCGGCTACCAGTGCAGAAGAAAGGATATAATGAGCTACCACGTGAGAAGCCACACAGGGGAAAAAACCTTTTAAGTGTCACCTCTGCAACTATGCATGCCAAAGAAGAGATGCGCTC', target_start_index=0, target_end_index=180)
TesseraeAlignmentResult(seq_name='Degron16_c60',         alignment_string='TTCAATGTCTTAATGGTTCATAAGCGAAGCCATACTGGTGAACGCCCACTGCAGTGCGAGATCTGCGGCTACCAGTGCAGAAGAAAGGATAGAATGAGCTACCACGTGAGAAGCCACACAGGGG-AAAAACCTTTTAAGTGTCACCTCTGCAACTATGCATGCCAAAGAAGAGATGCGCTC', target_start_index=0, target_end_index=179)

Produced by:

tesserae -v align -r tests/resources/degron_test_read.fasta -s tests/resources/degron_test_target.fasta

However, when I do this alignment with a target fasta file that contains all possible sequences for this, I get a very strange result (aligned for ease of comparison):

TesseraeAlignmentResult(seq_name='Degron16_c60_as_read', alignment_string='TTCAATGTCTTAATGGTTCATAAGCGAAGCCATACTGGTGAACGCCCACTGCAGTGCGAGATCTGCGGCTACCAGTGCAGAAGAAAGGATATAATGAGCTACCACGTGAGAAGCCACACAGGGGAAAAAACCTTTTAAGTGTCACCTCTGCAACTATGCATGCCAAAGAAGAGATGCGCTC', target_start_index=0, target_end_index=180)
TesseraeAlignmentResult(seq_name='PAC_BIO_LEFT_ADAPTER', alignment_string='T', target_start_index=1, target_end_index=1)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string=' ', target_start_index=1, target_end_index=1)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string=' ---', target_start_index=1, target_end_index=1)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='    -----', target_start_index=1, target_end_index=1)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='         C', target_start_index=1, target_end_index=8)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='          taatggttcat---------', target_start_index=10, target_end_index=20)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                              G', target_start_index=28, target_end_index=28)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                               c', target_start_index=28, target_end_index=30)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                TACTGGTGAAC---------', target_start_index=32, target_end_index=42)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                    g', target_start_index=51, target_end_index=51)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                     GTGCAAGCTGT----------', target_start_index=51, target_end_index=63)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                          g', target_start_index=73, target_end_index=73)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                           TTCAGATGGAG---------', target_start_index=73, target_end_index=85)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                                               a', target_start_index=94, target_end_index=94)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                                                CAGTACCACCT----------', target_start_index=94, target_end_index=106)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                                                                     c', target_start_index=116, target_end_index=116)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                                                                      CAGGGGAAAAA----------', target_start_index=116, target_end_index=128)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                                                                                           g', target_start_index=137, target_end_index=137)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                                                                                            GTCACCTCTGC----------', target_start_index=137, target_end_index=149)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                                                                                                                 t', target_start_index=159, target_end_index=159)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                                                                                                                  CCAAAGAAGAG------', target_start_index=159, target_end_index=171)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                                                                                                                                   c', target_start_index=177, target_end_index=177)
TesseraeAlignmentResult(seq_name='Degron17_c60',         alignment_string='                                                                                                                                                                                    C', target_start_index=177, target_end_index=179)

Produced by:

tesserae -v align -r ~tests/resources/degron_test_read.fasta -s tests/resources/minimum_target_list.fasta

It looks like Tesserae picks the correct sequence, but it breaks the alignment up into chunks that are almost contiguous. This starkly differs from the reference implementation, which produces the correct result above.

This strange breaking of alignments seems similar to findings 1 and 2 which is why I think they're related.

I went over this case with Kiran and it seems like there is probably a bug lurking in here somewhere - maybe in the graph code itself, maybe just in the code to render the graph.

winni2k commented 4 years ago

I've had a quick look at jts_adding_tests. Thanks for putting the test cases on a branch and making them available!

The test case sequences are complex in that they consist of many target sequences and long query sequences. I think the first step to debugging this issue will be to shrink these test cases down to minimal reproducible examples (MREs). @jonn-smith, have you already shrunk these examples to a minimum or is that work still left to do?

A separate issue that I think needs pointing out right here and that I hope will avoid a lot of extra work down the line is that unit tests are meant to run fast (that is, really fast). They are meant to be small synthetic examples that mostly test corner cases. And so, the corollary and strong convention in both ruby and python TDD is that unit tests don't open files.

Anything that needs to open a file would likely go into acceptance tests.

jonn-smith commented 4 years ago

The issue with trying to shrink the problem further is that I don't have a good intuition about which sequences to remove (since the alignment shouldn't depend on the number of sequences). I will try a couple of things but I don't have time to spend a while trying to optimize these test cases.

I'm happy to move them to acceptance tests.

jonn-smith commented 4 years ago

I spent some additional time trying to minimize the set that will produce this error. I have reduced the number of targets involved by half.

I actually spent a great deal of time yesterday preparing this bug report, so I don't have any more time to spare on it.

In general I try to reduce my tests to the minimum requirements - but when I don't have any intuition about the problem it's not useful to stab in the dark and try to minimize the test data involved (a SME will be able to do this very quickly when the issue is found).

karljohanw commented 4 years ago

I have done some experiments on this, and this issue is reproducible even with the tesserae-implementation as of commit 5e76402 in the original cortexpy-repo (the commit where tesserae was first introduced in python). So this is something that hasn't to do with the improvements of memory and computational speed.

However, when comparing this to the reference MosaicAligner mentioned above, it turns out that the following code, in tesserae.py at line 469-488 (function __render) has no corresponding code in MosaicAligner:

            if (
                i > cp
                and self.maxpath_copy[i] == self.maxpath_copy[i - 1]
                and np.abs(self.maxpath_pos[i] - self.maxpath_pos[i - 1]) > 1
                or self.maxpath_pos[i] == last_known_pos + 1
            ):
                self.path.append(
                    TesseraeAlignmentResult(
                        current_track, "".join(sb), pos_start, pos_end
                    )
                )
                uppercase = not uppercase
                last_known_pos = self.maxpath_pos[i - 1]

                if pos_start != pos_end:
                    pos_start = self.maxpath_pos[i] - 1
                    pos_end = self.maxpath_pos[i] - 1

                current_track = seqs[self.maxpath_copy[i] + 1][0]
                sb = [repeat(" ", i - cp)]

When commenting this out, you'll at least get rid of the adjacent alignment issue. So the question is why this code was added in the first place.

jonn-smith commented 4 years ago

That makes sense. I didn't necessarily think it was the speed improvements - it seems like something strange is going on somewhere deep in the code.

Good find! I can try pulling it out and seeing what difference it makes, though I still don't have any intuitions about that part of the codebase (yet).

winni2k commented 4 years ago

Thanks for looking into this @karljohanw!

I appreciate all the work you are putting into this @jonn-smith. Thank you for sharing more about your work process. It helps to know that we are on the same page.

winni2k commented 4 years ago

I think this probably goes back to the initial implementation by @kvg. Do you happen to remember what is going on here?

kvg commented 4 years ago

I agree. I'm not sure what's happening yet; I'll step through the code over the weekend and see where the bug arose.

winni2k commented 4 years ago

Maybe just delete it and if the tests still pass, call it a day 😛

karljohanw commented 4 years ago

Interestingly enough to mention, if one moves the sequence "Degron16_c60" to the top of the file master_target_list.fasta, then tesserae produces the same alignment as given by MosaicAligner, and that's without changing a single line of code. (Just like when that sequence was the only target.)

Furthermore, both the correct and wrong output of this run of tesserae yields the same Mllk value (-56.46582694079763), i.e. with and without "Degron16_c60" in the top. BUT when looking at the max_r-values in the end of __recurrence, it seems there is one and only one largest value, and not several, which would have made sense in this case...

jonn-smith commented 4 years ago

Huh. That's very strange. So maybe it's an issue in how the graph is seeded?

jonn-smith commented 4 years ago

I've been fiddling with the C code a little and I think this might have to do with the conditions on which the sequence is broken into a new output entry.

Specifically I found that I needed to have it break when the position in the target decreased from one alignment base to the next. Forcing the alignment to break at this point and produce another one created very similar results to the python code (though not exactly the same).

For reference, in mosaic_fb.c:1709 or so:

Changing:

if (i>cp && my_matrices->maxpath_copy[i] != my_matrices->maxpath_copy[i-1]) {

to

if ( (i>cp && my_matrices->maxpath_copy[i] != my_matrices->maxpath_copy[i-1]) || ( my_matrices->maxpath_pos[i] < my_matrices->maxpath_pos[i-1]) ) {

I updated my fork to have a command-line parameter to enable this functionality. I'm going to submit a PR for this later.