Update downsample bam - Githubissues

mpoelchau commented 4 years ago

We'll need to downsample the merged bam file before transferring it to our servers. Picard used to be able to do this to a max depth, but the program deprecated that function.

For now, I'd suggest we use samtools view -bs [randomseed].[proportion] [merged_bamfile]

[randomseed]- generate a random number
[proportion] - I'd suggest that the proportion is 1/number of SRA files. For example, if the merged bamfile was made with 9 original SRA files, then the proportion would be 1/9, or 0.1111 (let's cut off after 4 decimal points). Or if there are only 2 files, then it would be 1/2, or 0.5.

In the code, this would replace the function starting here: https://github.com/NAL-i5K/NAL_RNA_seq_annotation_pipeline/blob/update-rnannot/rnannot/RNAseq_annotate.py#L428

Samtools documentation for samtools view -s (http://www.htslib.org/doc/samtools-view.html):

-s FLOAT Output only a proportion of the input alignments. This subsampling acts in the same way on all of the alignment records in the same template or read pair, so it never keeps a read but not its mate.

The integer and fractional parts of the -s INT.FRAC option are used separately: the part after the decimal point sets the fraction of templates/pairs to be kept, while the integer part is used as a seed that influences which subset of reads is kept.

When subsampling data that has previously been subsampled, be sure to use a different seed value from those used previously; otherwise more reads will be retained than expected.

mpoelchau commented 4 years ago

Another strategy: https://davemcg.github.io/post/easy-bam-downsampling/

mpoelchau commented 4 years ago

This is moot, since we've decided to normalize the reads instead.

NAL-i5K / NAL_RNA_seq_annotation_pipeline

Update downsample bam #27