Nextomics / NextPolish

Fast and accurately polish the genome generated by long reads.
GNU General Public License v3.0
205 stars 27 forks source link

BUSCO and LAI decrease after polish #64

Open Weihankk opened 3 years ago

Weihankk commented 3 years ago

Question or Expected behavior Hello, I use NextPolish with Illumina reads to polish my Pacbio ( by CANU ) assembly contigs, but I found the BUSCO and LAI decrease after the polish step. I have made some tests and all of them seem to decrease the BUSCO and LAI score. I am curious why this is, do you have any insights on this phenomenon.

NextPolish NextPolish version 1.3.1 Below is my run script:

#!/bin/bash
#Set input and parameters                                                                                                                                                                                                                                                                                                   
round=1
threads=30
read1=$2
read2=$3
input=$1

for ((i=1; i<=${round};i++)); do
    #step 1
    # index the genome file and do alignment
    bwa index ${input}
    bwa mem -t ${threads} ${input} ${read1} ${read2}|samtools view --threads 6 -F 0x4 -b -|samtools fixmate -m --threads 6  - -|samtools sort -m 2g --threads 6 -|samtools markdup --threads 6 -r - sgs.sort.bam
    #index bam and genome files
    samtools index -@ ${threads} sgs.sort.bam
    samtools faidx ${input}
    #polish genome file
    python /store/whzhang/tools/NextPolish_1.3.1/NextPolish/lib/nextpolish1.py -g ${input} -t 1 -p ${threads} -s sgs.sort.bam > genome.polishtemp.fa
    input=genome.polishtemp.fa
    #step2

    #index genome file and do alignment
    bwa index ${input}
    bwa mem -t ${threads} ${input} ${read1} ${read2}|samtools view --threads 6 -F 0x4 -b -|samtools fixmate -m --threads 6  - -|samtools sort -m 2g --threads 6 -|samtools markdup --threads 6 -r - sgs.sort.bam
    #index bam and genome files
    samtools index -@ ${threads} sgs.sort.bam
    samtools faidx ${input}
    #polish genome file
    python /store/whzhang/tools/NextPolish_1.3.1/NextPolish/lib/nextpolish1.py -g ${input} -t 2 -p ${threads} -s sgs.sort.bam > genome.nextpolish.fa
    input=genome.nextpolish.fa
done

Additional context (Optional) I have tried some combination polish methods and test their BUSCO, LAI:

  Contig N50 (Mb) BUSCO (%) LAI
Raw contig 5.77 98.80 23.29
Raw + Arrow (1 round) 5.77 98.60 23.68
Raw + Arrow + NextPolish (4 round) 5.77 98.70 23.12
Raw + NextPolish (1 round) 5.77 98.70 23.25
Raw + Arrow + NextPolish (1 round) 5.77 98.70 23.15

My Pacbio data is ~160x, and my Illumina short reads is ~60x. As you can see, all polish step will decrease the BUSCO and LAI. It seems use Arrow and Pacbio subreads will decrease more score.

moold commented 3 years ago

Is your genome assembled with hifi? If it is, there is no need to polish it using subreads, but you can polish it use HIFI reads. Btw, because these results are very similar, so this difference is probably due to few gene differences caused by random mapping, you can ignore it. Of course, you can call homozygous SNP to evaluate global accuracy.