High duplication percentage

albert-ying commented 1 year ago

Hi! Thank you for the tool! I tried to deduplicate my Bismark alignment in WGBS data and I found that the duplication rate is very high (70-80% duplication).

Here is a example deduplication_report.txt:

Total number of alignments analysed in ./BISMARK_output/MGB9_CKDL230019771-1A_225VT3LT3_L1_1_bismark_bt2_pe.bam:   5235999
Total number duplicated alignments removed:     3970342 (75.83%)                          
Duplicated alignments were found at:    838656 different position(s)                      

Total count of deduplicated leftover sequences: 1265657 (24.17% of total)

I'm wondering if it is normal? Could it indicate that I did something wrong in previous steps?

For the reference, I followed the steps in the tutorial:

$BISMARK_GENOME_PREPARATION --path_to_aligner /home/kying/mambaforge/bin/ --verbose $REF_GENOME

$BISMARK --bowtie2 \
             --path_to_bowtie /home/kying/mambaforge/bin/ \
             --genome_folder $REF_GENOME \
             --output_dir $OUTPUT_DIR \
             -1 $READ1 \
             -2 $READ2

$BISMARK_DUP --output_dir $OUTPUT_DIR $SAMPLE_file

Thank you!

FelixKrueger commented 1 year ago

Hi @albert-ying

The duplication rate is very much dependent on two factors:

The complexity of the library
the sequencing depth

Regarding the library complexity: a well diverse library has a many different fragments, and you are very unlikely to sequence the same fragment (generated during PCR amplification) more than once. A number of procedures can influence this: (i) by design: methods such as RRBS which chops the genome into defined number of fragments you can then sequence, or target enrichment, or amplicon sequence, may result in very high duplication rates. (ii) by misfortune: if you have a very low amount of starting material, or if you experience a high amount of degradation during the library prep (e.g. caused by the bisulfite treatment), you will reduce the number of unique fragments in the mix Tools such as Preseq (https://smithlabresearch.org/software/preseq/) may help assess you assess the complexity.

The sequencing depth is then a more of less a function of the library complexity. As a rule of thumb: the deeper you sequence, the higher the deduplication rate.

A potential third factor in the duplication puzzle could be the occurrence of fragments that align to the very same position in the genome, in the same orientation (and for paired-end libraries even the same start and end point).

To judge better what you have at hand one would need to know more about the sample type and the sequencing depth, but the above points should already give you some pointers. As very rough pointers I would hope that WGBS samples are in the lower region of duplicate alignments (10-30%?), while RRBS would be in the region of 95%+ (note that these types of library are generally not recommended to be deduplicated..).

albert-ying commented 1 year ago

Thank you so much for the detailed explanation!

FelixKrueger / Bismark

High duplication percentage #630