biomedicalinformaticsgroup / Sargasso

Sargasso disambiguates mixed-species high-throughput sequencing data.
http://biomedicalinformaticsgroup.github.io/Sargasso/
Other
8 stars 4 forks source link

Add option(s) for "conservative" filtering #33

Closed lweasel closed 8 years ago

lweasel commented 8 years ago

I think it might make sense to split this into two options:

a) Option to check for soft or hard clipping, insertions, deletions etc.

b) Option to reject any read which multimaps in either species.

I know we got rid of (b) as it didn't make sense for, e.g., the mouse vs rat comparisons. But I'm finding that in the case of draft genomes (e.g. the mouse strains) allowing these multimaps can lead to a lot of misassignments. For example, suppose a real, perfect Mus musculus castaneus read should lie in a region that is missing from the castaneus genome, but the read also multimaps at lower quality at several other places in the genome (and the read also maps perfectly in one position to the mouse reference genome, and also multimaps at lower quality in several other locations). With the STAR settings we have, a single perfect mapping will be reported for the mouse reference, and multiple poorer mappings will be reported for castaneus - thus, without the restriction on multimaps, the read will be assigned to the mouse reference genome, when it should really be rejected as ambiguous.