NAL-i5K / NAL_RNA_seq_annotation_pipeline

Other
5 stars 3 forks source link

Update downsample bam #27

Closed mpoelchau closed 4 years ago

mpoelchau commented 4 years ago

We'll need to downsample the merged bam file before transferring it to our servers. Picard used to be able to do this to a max depth, but the program deprecated that function.

For now, I'd suggest we use samtools view -bs [randomseed].[proportion] [merged_bamfile]

In the code, this would replace the function starting here: https://github.com/NAL-i5K/NAL_RNA_seq_annotation_pipeline/blob/update-rnannot/rnannot/RNAseq_annotate.py#L428

Samtools documentation for samtools view -s (http://www.htslib.org/doc/samtools-view.html):

-s FLOAT Output only a proportion of the input alignments. This subsampling acts in the same way on all of the alignment records in the same template or read pair, so it never keeps a read but not its mate.

The integer and fractional parts of the -s INT.FRAC option are used separately: the part after the decimal point sets the fraction of templates/pairs to be kept, while the integer part is used as a seed that influences which subset of reads is kept.

When subsampling data that has previously been subsampled, be sure to use a different seed value from those used previously; otherwise more reads will be retained than expected.

mpoelchau commented 4 years ago

Another strategy: https://davemcg.github.io/post/easy-bam-downsampling/

mpoelchau commented 4 years ago

This is moot, since we've decided to normalize the reads instead.