dfguan / purge_dups

haplotypic duplication identification tool
MIT License
209 stars 20 forks source link

Significant drop in BUSCO + weird histogram plot with Illumina reads #37

Open majogomezhughes opened 4 years ago

majogomezhughes commented 4 years ago

Hi dfguang, I am trying to use pruge_dups for a mammalian assembly since the busco results came out with a high duplication rate: C:84%[D:49%],F:7.5%,M:7.7%,n:3023 After running purge_dups as recommended for Illumina reads with the following code: bwa index primdraft.fa bwa mem -t16 primdraft.fa ../../Data/reads/reads1.fq.gz ../../Data/reads/reads2.fq.gz | samtools view -@16 -b -o - > primdraft.bam ../../../../programs/purge_dups/src/ngscstat -q 30 primdraft.bam ../../../../programs/purge_dups/bin/split_fa primdraft.fa > primdraft-split.fa minimap2 -xasm5 -DP primdraft-split.fa primdraft-split.fa | gzip -c - > primdraft-split.fa.self.paf.gz ../../../../programs/purge_dups/bin/purge_dups -2 -T cutoffs -c TX.base.cov primdraft-split.fa.self.paf.gz > dups.bed 2> purge_dups.log ../../../../programs/purge_dups/bin/get_seqs dups.bed primdraft.fa The busco results dropped to: C:15%[D:0.0%],F:2.3%,M:81%,n:3023 I plotted the histogram plot and got this which I think seems a bit odd?: cutoffs-hist Maybe new cutoffs values will help, but given that histogram I wouldn't know which to pick.

Thank you so much for your help! Maria

dfguan commented 4 years ago

Hi Maria, I guess your assembly is highly repetitive, this can explain why you are having a lot of bases with very low read depth (<5). Maybe you could try ngscstat with lower q, say 15, this may help to correct the plot? Sorry this may not be an useful suggestion, I have very little experience in playing with short reads. Best, Dengfeng.