bguo068 / snp_call_nf

Nextflow pipeline for plasmodium SNP calling
MIT License
3 stars 3 forks source link

Saturation/Rarefaction Capability #5

Closed bguo068 closed 1 year ago

bguo068 commented 2 years ago

copied from https://github.com/umb-oconnorgroup/plasmodium_snp_call_pipeline_wdl/issues/3

Add support for subsampling BAM files to determine if sequencing was performed sufficiently. More specifically, the final goal is to determine metrics like %genome coverage or % 5x genome coverage as a function of the subsampled read number.

Bash code used in another pipeline that uses seqtk to subsample the given number of reads for generation of BAM files:

Subsetting for rarefaction analysis (500K, 1M, 3M, 5M, 10M, 20M, 30M)

for i in 500000 1000000 2000000 3000000 5000000 10000000 15000000 20000000 25000000 30000000

    do
    echo "Round reads $i"
    out1=pair1$i
    out2=pair2$i
    /usr/local/packages/seqtk/bin/seqtk sample -s100 $mate_1_f $i > $out1
    /usr/local/packages/seqtk/bin/seqtk sample -s100 $mate_2_f $i > $out2

Below is code used to create recalibrated coverage files from the resulting output using bedtools.

for f in ./*/;
        do
        for j in 500000 1000000 2000000 3000000 5000000 10000000 15000000 20000000 25000000  30000000
                do
                bedtools genomecov -d -ibam $f/alignments/${j}_recalibrated.bam -g /local/data/Malaria/Projects/Takala-Harrison/PfCRT_Cambodia/PfCRT_Cambodia_tmp/IVC_WR1576_WGS/MappingAndSNPCallingFiles/PlasmoDB_v44_Reference/PlasmoDB-44_Pfalciparum3D7_Genome.fasta > $f/alignments/${j}_recalibrated.coverage.perbase.new;
                done
        done
bguo068 commented 2 years ago

We are going to use bam files from the process GATK_APPLY_BQSR

image
bguo068 commented 1 year ago

The issue is more like a problem for making pseudo samples than one for extensively modifying the pipeline.

@Matthew-A-epi

bguo068 commented 1 year ago

close with the commit above