biomedicalinformaticsgroup / Sargasso

Sargasso disambiguates mixed-species high-throughput sequencing data.
http://biomedicalinformaticsgroup.github.io/Sargasso/
Other
8 stars 4 forks source link

Possible issue with overhang criteria #32

Closed lweasel closed 5 years ago

lweasel commented 8 years ago

Just entering this so I remember it (though it may not be an issue with the filtering as it now is on the master branch).

Saw a situation when filtering C57BL/6J vs Mus musculus castaneus data, where a perfect matching C57BL/6J read was being incorrectly assigned to CAST. The correct C57BL/6J mapping had an overhang of 3 bases over an exon boundary. In CAST one of these three bases is mutated, and it just happens that the 3 bases in the intron next to where the rest of the read maps are exactly the same as the overhanging 3 bases of the read. Our filtering says that because the overhang in C57BL/6J is < 5 bases, this mapping is invalid, and the read is assigned to CAST because it has a perfect (though actually incorrect) mapping. It might be better to say that such a read is ambiguous, but that would need evaluating in terms of the trade-off between changes in TPs and FPs.

s-heron commented 8 years ago

This situation sounds like it has a fair bit of knowledge that we simply won't have(?) at the filtering stage, and would only be gleaned from such further investigation as you did yourself. What criteria could we use here in the general case to identify such a situation at the filtering stage?

lweasel commented 8 years ago

I guess what I'm proposing is potentially an option (or, in the first place, an investigation) into whether, in the case that a read fails the overhang criteria for a species, it is not just rejected for that species, but is marked as ambiguous and not assigned to either species.

I suspect this would lead to a very small number of incorrectly assigned reads being thrown away instead, at the cost of a much larger number of previously correctly assigned reads also being thrown away. So I don't think it's of any vast importance to look at this now, but may be worth keeping around in case we can investigate in the future.