Nextomics / NextPolish

Fast and accurately polish the genome generated by long reads.
GNU General Public License v3.0
213 stars 28 forks source link

How to polish with long reads and short reads in Single-End mode #54

Open gitcruz opened 4 years ago

gitcruz commented 4 years ago

Hi,

I have tried polishing an assembly with short-reads (HiC reads with -unpaired option in cfg) and long-reads (nextdenovo error corrected reads). But is taking longer than expected (before 5 iterations with PacBio corrected reads took 1 day) . So, I am worried that is not working well but i don't see any error in the pid.log. Below I show you a tail -f

[INFO] 2020-10-08 19:18:29,252 total jobs: 1 [INFO] 2020-10-08 19:18:29,254 Throw jobID:[25602] jobCmd: s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/out-Lynruf5-3long-2short/00.lgs_polish/02.map.ref.sh.work/map_genome0/Lynruf5-3long-2short.sh] in the local_cycle. [INFO] 2020-10-08 22:06:26,494 align_genome done [INFO] 2020-10-08 22:06:26,500 analysis tasks done [INFO] 2020-10-08 22:06:26,505 total jobs: 1 [INFO] 2020-10-08 22:06:26,507 Throw jobID:[4844] jobCmd:[s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/out-Lynruf5-3long-2short/00.lgs_polish/03.merge.bam.sh.work/merge_bam0/Lynruf5-3long-2short.sh] in the local_cycle. [INFO] 2020-10-08 22:18:06,920 merge_bam done [INFO] 2020-10-08 22:18:06,926 analysis tasks done [INFO] 2020-10-08 22:18:06,930 total jobs: 1 [INFO] 2020-10-08 22:18:06,931 Throw jobID:[5628] jobCmd:[s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/out-Lynruf5-3long-2short/00.lgs_polish/04.polish.ref.sh.work/polish_genome0/Lynruf5-3long-2short.sh] in the local_cycle.

I am afraid I might be passing some of the instructions wrong. Perhaps the iterations are not "12" for single-end mode:

[General] job_type = local job_prefix = asm5-3long-2short task = 555121212 rewrite = yes rerun = 2 parallel_jobs = 1 multithread_jobs = 24 genome = ./input_assembly/LynRuf5.fa genome_size = auto workdir = ./out-asm5-3long-2short polish_options = -p {multithread_jobs}

[sgs_option] sgs_fofn = ./sample4-illumina_hic.fofn sgs_options = -unpaired -max_depth 30

[lgs_option] lgs_fofn = ./sample4-corrected_pacbio.fofn lgs_options = -min_read_len 10k -max_read_len 135k -max_depth 40 lgs_minimap2_options = -x map-pb -t 6

The reads fofn look like this: :::::::::::::: ../s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/lru4-corrected_pacbio.fofn :::::::::::::: reads/corrected_pacbio/cns0.fasta reads/corrected_pacbio/cns1.fasta reads/corrected_pacbio/cns2.fasta reads/corrected_pacbio/cns3.fasta reads/corrected_pacbio/cns4.fasta reads/corrected_pacbio/cns5.fasta reads/corrected_pacbio/cns6.fasta :::::::::::::: ../s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/lru4-illumina_hic.fofn :::::::::::::: reads/illumina_raw_hic/Lru-4_hic_R1.fastq.gz reads/illumina_raw_hic/Lru-4_hic_R2.fastq.gz

NOTE I removed sensitive info about the genome from the absolute paths. I want to use the HiC reads in single-end mode because they could come from distant locations in the genome.

Please let me know if I am doing anything wrong.

Thanks, F

moold commented 4 years ago

Hi, in generally, I do not recommend polishing using error-corrected reads, just use raw reads. The error-corrected reads may contain some bias errors (induced by error correction step). BTW, ~30-40x reads are not enough if you want to get a high accuracy assembly. I also dot not recommend polishing using single end reads, because of random mapping in high repeat regions for single end reads.

gitcruz commented 4 years ago

Ok, I understand. Then I will set things to polish with raw Pacbio reads (~70x). The corrected reads have ~40x coverage.

On a previous test with Pacbio corrected reads I run 5 iterations. But actually the 2nd one seems to be the best (better BUSCO and less missasemblies when comparing to a close reference). I think you recommend 3 for long-read polishing, right?

Thank you very much, Fernando

On Thu, 15 Oct 2020 at 15:15, Hu Jiang notifications@github.com wrote:

Hi, in generally, I do not recommend polishing using error-corrected reads, just use raw reads. The error-corrected reads may contain some bias errors (induced by error correction step). BTW, ~30-40x reads are not enough if you want to get a high accuracy assembly. I also dot not recommend polishing using single end reads, because of random mapping in high repeat regions for single end reads.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nextomics/NextPolish/issues/54#issuecomment-709316603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB34KVOLYLLZMJ6KSEGMJGDSK3YWBANCNFSM4SR3RWUQ .

moold commented 4 years ago

Hi, 2-3 iterations is ok, but the finally accuracy of an assembly is depending on short reads polishing.

Rob-murphys commented 4 years ago

@moold Hi, Do you mean do not do any form of cleaning on the raw reads, such as with bbtools before using in next polish here?

moold commented 4 years ago

Not exactly, I mean, if you want to get a high-accuracy genome, whether you polished it using long reads or not, you should polish the genome using short reads in the last step. It is difficult to produce a high-accuracy genome using long noise reads only. Of course, it's better to do some cleaning on the short raw reads to remove some low QV reads.