biomedicalinformaticsgroup / Sargasso

Sargasso disambiguates mixed-species high-throughput sequencing data.
http://biomedicalinformaticsgroup.github.io/Sargasso/
Other
8 stars 4 forks source link

Enhance filtering summary statistics #48

Closed lweasel closed 5 years ago

lweasel commented 6 years ago

Currently in the filtering summary statistics CSV file we record, per-species, the numbers of hits and reads which:

1) are assigned to the species after filtering 2) are rejected from the species - either because (a) they map to the other species better, or (b) they do not map sufficiently well to this species according to some criteria 3) which map equally well to both species

I think it would be useful to record (2a) and (2b) separately.

lweasel commented 6 years ago

Actually, it would be really useful to have the ability to get a much finer-grained picture of why reads are being rejected (e.g. is it due to mismatches, or the length criterion, or due to CIGAR strings etc. etc.). It just occurred to me that one way that we could do this, and also potentially solve the problem of how to debug what is happening to individual reads, would be to have a "verbose" mode (or even different levels of "verbose" mode), in which we use custom versions of the SeparationStats class.

Every read that is either accepted, or rejected, or marked as ambiguous, passes through this class. At the moment the class just records (per-species) the totals for "accept", "reject" and "ambiguous". But there is no reason (I think) why a different version of the class couldn't record how many reads were rejected due to failing mismatch, or failing minmatch, or due to the CIGAR string etc. Another version could print out detailed debug information for each read. And because we'd be using a different version of the class in "verbose" mode, there wouldn't be any slow-down in the normal operation of Sargasso.

What do you think @hxin ? We can chat when you're back in.

hxin commented 6 years ago

I think this is a good approach to explore. I did not notice this class before. I just had a look at it and it seems that this is a good place to keep track of such information, which we also need for debug purpose. @lweasel

lweasel commented 6 years ago

I had completely forgotten about this class too...

lweasel commented 5 years ago

Essentially covered by additions to debug mode.