Clarification on the definition of spanning for reads that terminate in the middle of an exon for `exclude_counts`

skagawa2 commented 1 month ago

Hello! Thank you for creating this software!

We were wondering if you could clarify for us the definition of spanning, but not including a feature for the purposes of the exclude_counts column in the exon_counts.tsv file. We've noticed that reads that terminate in the middle of an exon get counted as spanning the exon for the purposes of this column. Our case can be simulated by the below test case (adapted from tests/test_long_read_profile.py)

class TestOverlappingFeatureProfile:
    @pytest.mark.parametrize("known_features, gene_region, read_exons, polya_pos, polyt_pos, delta, expected_gene, expected_read",
                             [([(20, 60), (90, 110), (180, 200)], (20, 200), [(20, 60), (90, 110), (180, 200)],
                               -1, -1, 3, [1, 1, 1], [1, 1, 1]),
                              ([(20, 60), (90, 110), (180, 200)], (20, 200), [(21, 60), (91, 111), (178, 200)],
                               -1, -1, 3, [1, 1, 1], [1, 1, 1]),
                              ([(20, 60), (90, 110), (180, 200)], (20, 200), [(20, 60), (99, 110), (180, 200)],
                               -1, -1, 3, [1, -1, 1], [1, -1, 1]),
                              ([(20, 60), (90, 110), (180, 200)], (20, 200), [(20, 60), (90, 110)],
                               -1, -1, 3, [1, 1, 0], [1, 1]),
                              ([(20, 60), (90, 110), (180, 200)], (20, 200), [(90, 110), (180, 200)],
                                -1, -1, 3, [0, 1, 1], [1, 1]),
+                               ([(20, 60), (90, 110), (180, 200)], (20, 200), [(20, 60), (90, 100)],
+                                -1, -1, 3, [1, 0, 0], [1, -1]),
                              ])
    def test_exon_profile(self, known_features, gene_region, read_exons, polya_pos, polyt_pos, delta, expected_gene, expected_read):
        exon_profile_constructor = \
            OverlappingFeaturesProfileConstructor(known_features, gene_region,
                                                  comparator=partial(equal_ranges, delta=delta),
                                                  delta=delta)
        profile = exon_profile_constructor.construct_exon_profile(read_exons, polya_pos, polyt_pos)
        assert profile.gene_profile == expected_gene
        assert profile.read_profile == expected_read

where a read ends in the middle of an exon, suggesting an APA site (as found by your software). As the File formats page indicates, include_counts - number of reads that include this feature; exclude_counts - number of reads that span, but do not include this feature;, but in this case, due to this assignment to -1:

https://github.com/ablab/IsoQuant/blob/fd6407d6f79e9351daf845dc3d575fc4292e94a6/src/long_read_profiles.py#L141-L143

this read excludes the exon (90, 110) (in this example) even though the read doesn't span this feature fully (just overlaps). We were wondering if this was the intended definition for spanning an exon, as we were attempting to calculate PSI for our samples.

andrewprzh commented 3 weeks ago

Dear @skagawa2

Thank you so much for the investigation. I think this the wrong behavior and such partially covered exons should not be treated as "spanned" in general.

If an alternative poly-A site follows the exon of question, it can be possibly treated as spanned since this entire exon is not part of the read. In general, if there is no polyA and a reads is, for example, truncated, it is not possible to state whether this exon is really part of the read or not, and thus 0 should be in the profile.

In fact, in OverlappingFeaturesProfileConstructor, -1 should be set only when absence_condition is true, which in case of exon_profile_constructor checks if a known exon is entirely inside the read. It feels like I forgot to remove intron_profile[gene_pos] = -1 on line 142 during refactoring at some point.

I'll fix that and will make a new release.

Best Andrey

andrewprzh commented 3 weeks ago

@skagawa2 fixed now in 3.5.2!

skagawa2 commented 3 weeks ago

Awesome! Thank you for the quick action!

ablab / IsoQuant

Clarification on the definition of spanning for reads that terminate in the middle of an exon for `exclude_counts` #229