This PR adds functionality to optionally filter reads after mapping in the align_and_count task, so the counts of mapped reads are comparable to those following filtering during genome assembly. It also adds new numeric outputs relevant for general QC purposes.
do_not_require_proper_mapped_pairs_when_filtering: do not exclude reads lacking the "proper pair" bit; this is helpful/necessary to set to true when using single-end reads as input if filtering is enabled
default: false — reads are filtered to proper pairs if filtering is enabled
keep_singletons_when_filtering: singleton reads from paired-end data are kept; this does not affect single-end reads
default: false — singleton reads are excluded during filtering
keep_duplicates_when_filtering: reads marked as duplicates are kept; this does not supersede exclusion for violations of other criteria
default: false — duplicate reads are excluded during filtering
New output metrics
This PR also adds new numeric output metrics to align_and_count:
pct_total_reads_mapped: the percent of input reads mapping to any of the input reference sequences
this is helpful for assessing the fraction of reads in a sample originating from sources corresponding to the reference sequences
pct_lesser_hits_of_mapped: of the reads mapping to reference sequences input to align_and_count, the percent mapping to hits that are not the top hit
this is helpful for assessing cross-talk between hits
The new outputs are exposed in several of the workflows that have singular outputs from align_and_count. A few other workflows call align_and_count, but output an aggregate report with info from multiple inputs.
Recommended usage
The following values are recommended for most use cases, to count high-quality read mappings with duplicates included.
Summary
This PR adds functionality to optionally filter reads after mapping in the
align_and_count
task, so the counts of mapped reads are comparable to those following filtering during genome assembly. It also adds new numeric outputs relevant for general QC purposes.New input parameters
The filtering has the following parameters:
filter_bam_to_proper_primary_mapped_reads
: enable filteringfalse
— no filtering is performeddo_not_require_proper_mapped_pairs_when_filtering
: do not exclude reads lacking the "proper pair" bit; this is helpful/necessary to set totrue
when using single-end reads as input if filtering is enabledfalse
— reads are filtered to proper pairs if filtering is enabledkeep_singletons_when_filtering
: singleton reads from paired-end data are kept; this does not affect single-end readsfalse
— singleton reads are excluded during filteringkeep_duplicates_when_filtering
: reads marked as duplicates are kept; this does not supersede exclusion for violations of other criteriafalse
— duplicate reads are excluded during filteringNew output metrics
This PR also adds new numeric output metrics to
align_and_count
:pct_total_reads_mapped
: the percent of input reads mapping to any of the input reference sequencespct_lesser_hits_of_mapped
: of the reads mapping to reference sequences input toalign_and_count
, the percent mapping to hits that are not the top hitThe new outputs are exposed in several of the workflows that have singular outputs from
align_and_count
. A few other workflows callalign_and_count
, but output an aggregate report with info from multiple inputs.Recommended usage
The following values are recommended for most use cases, to count high-quality read mappings with duplicates included.
filter_bam_to_proper_primary_mapped_reads=true
keep_duplicates_when_filtering=true