Update Summarizer in Production

ababaian commented 4 years ago

Just a note to myself to get back to this ASAP. Update summarizer with this version into production Serratus.

Robert
See here:

https://github.com/ababaian/serratus/blob/master/src/serratus_summarizer_flom.py

I intended to put it under src/summarizer, but my aim was a bit off.

Usage is exactly as previous version with tinyhit.

Environment variable SUMZER_DIR must be directory containing acc_len_taxid.txt and taxid_desc.txt.

Positional arguments are:

(Required) Input filename for SAM data, use /dev/stdin for bowtie2 pipe.
(Required) Output tsv filename for summary.
(Required) Output filename for echoed SAM data, use /dev/stdout for samtools pipe or /dev/null to discard output if generating summaries locally.
(Optional) Output tsv filename for tinyhit formatted data.

You should check the first couple of runs to make sure I didn't make any major mistakes. R

The first line of the summary output now looks like this:

grep -i ^score * | head ERR2906838.tsv:score=87.5;pctid=96.6;pancov=...........................O;topacc=AX191449.1;topdesc=Rat coronavirus; SRR11454606.tsv:score=100.0;pctid=96.6;pancov=...............................O;topacc=MT121215.1;topdesc=Severe acute respiratory syndrome coronavirus 2; ...

The score in this version is the coverage of the pan-genome, i.e. the number of bins with at least one read divided by the number of bins (32).

Later lines in the file are results for each reference sequence individually, similar to previous versions.

The cov2m reference includes many very short genomes (200nt and up), for which this strategy may not work well. To increase the chances that this summarizer will do a good job of ranking Cov and other longer genomes, it sets a minimum length of 1,500nt for a reference to be included in the mega-genome. Alignments to genomes <1500nt count towards a new "short_hits" total but add nothing to the score. This is designed to ensure that Cov is ranked highly without noise from short genomes, e.g. if they hit host and should be blacklisted. If I have time, I will look at datasets with high short_hits counts separately to see if there is anything interesting.

R

Also, reference genomes are now ranked by decreasing coverage instead of by decreasing number of hits. This is to mitigate the problem that one black region can attract a large number of hits, we don't care about that genome unless there is coverage across some reasonable fraction of the bins. R

ababaian commented 4 years ago

For the next iteration of this. Can @rcedgar merge the acc_len_taxid.txt and taxid_desc.txt files (include tax description for each accession) and remove the name requirement here. I would like to have genome version be in these file-names and therefore we have one version of this file per reference i.e. cov2m.taxid or flom0.taxid.

rcedgar commented 4 years ago

The summarizer is due for a careful re-design. It was a very quick hack so that we could get a handle on the first runs, I want to look at the FLOM output and think about how to do a better job of the engineering. However, as a general rule I prefer to avoid duplicated data and keep things lightweight as possible, which argues for the current two files rather than one. I take the point that it is a bit awkward to deploy.

ababaian / serratus

Update Summarizer in Production #98