How to polish with long reads and short reads in Single-End mode

gitcruz commented 4 years ago

Hi,

I have tried polishing an assembly with short-reads (HiC reads with -unpaired option in cfg) and long-reads (nextdenovo error corrected reads). But is taking longer than expected (before 5 iterations with PacBio corrected reads took 1 day) . So, I am worried that is not working well but i don't see any error in the pid.log. Below I show you a tail -f

[INFO] 2020-10-08 19:18:29,252 total jobs: 1 [INFO] 2020-10-08 19:18:29,254 Throw jobID:[25602] jobCmd: s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/out-Lynruf5-3long-2short/00.lgs_polish/02.map.ref.sh.work/map_genome0/Lynruf5-3long-2short.sh] in the local_cycle. [INFO] 2020-10-08 22:06:26,494 align_genome done [INFO] 2020-10-08 22:06:26,500 analysis tasks done [INFO] 2020-10-08 22:06:26,505 total jobs: 1 [INFO] 2020-10-08 22:06:26,507 Throw jobID:[4844] jobCmd:[s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/out-Lynruf5-3long-2short/00.lgs_polish/03.merge.bam.sh.work/merge_bam0/Lynruf5-3long-2short.sh] in the local_cycle. [INFO] 2020-10-08 22:18:06,920 merge_bam done [INFO] 2020-10-08 22:18:06,926 analysis tasks done [INFO] 2020-10-08 22:18:06,930 total jobs: 1 [INFO] 2020-10-08 22:18:06,931 Throw jobID:[5628] jobCmd:[s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/out-Lynruf5-3long-2short/00.lgs_polish/04.polish.ref.sh.work/polish_genome0/Lynruf5-3long-2short.sh] in the local_cycle.

I am afraid I might be passing some of the instructions wrong. Perhaps the iterations are not "12" for single-end mode:

[General] job_type = local job_prefix = asm5-3long-2short task = 555121212 rewrite = yes rerun = 2 parallel_jobs = 1 multithread_jobs = 24 genome = ./input_assembly/LynRuf5.fa genome_size = auto workdir = ./out-asm5-3long-2short polish_options = -p {multithread_jobs}

[sgs_option] sgs_fofn = ./sample4-illumina_hic.fofn sgs_options = -unpaired -max_depth 30

[lgs_option] lgs_fofn = ./sample4-corrected_pacbio.fofn lgs_options = -min_read_len 10k -max_read_len 135k -max_depth 40 lgs_minimap2_options = -x map-pb -t 6

The reads fofn look like this: :::::::::::::: ../s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/lru4-corrected_pacbio.fofn :::::::::::::: reads/corrected_pacbio/cns0.fasta reads/corrected_pacbio/cns1.fasta reads/corrected_pacbio/cns2.fasta reads/corrected_pacbio/cns3.fasta reads/corrected_pacbio/cns4.fasta reads/corrected_pacbio/cns5.fasta reads/corrected_pacbio/cns6.fasta :::::::::::::: ../s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/lru4-illumina_hic.fofn :::::::::::::: reads/illumina_raw_hic/Lru-4_hic_R1.fastq.gz reads/illumina_raw_hic/Lru-4_hic_R2.fastq.gz

NOTE I removed sensitive info about the genome from the absolute paths. I want to use the HiC reads in single-end mode because they could come from distant locations in the genome.

Please let me know if I am doing anything wrong.

Thanks, F

moold commented 4 years ago

Hi, in generally, I do not recommend polishing using error-corrected reads, just use raw reads. The error-corrected reads may contain some bias errors (induced by error correction step). BTW, ~30-40x reads are not enough if you want to get a high accuracy assembly. I also dot not recommend polishing using single end reads, because of random mapping in high repeat regions for single end reads.

gitcruz commented 4 years ago

Ok, I understand. Then I will set things to polish with raw Pacbio reads (~70x). The corrected reads have ~40x coverage.

On a previous test with Pacbio corrected reads I run 5 iterations. But actually the 2nd one seems to be the best (better BUSCO and less missasemblies when comparing to a close reference). I think you recommend 3 for long-read polishing, right?

Thank you very much, Fernando

On Thu, 15 Oct 2020 at 15:15, Hu Jiang notifications@github.com wrote:

Hi, in generally, I do not recommend polishing using error-corrected reads, just use raw reads. The error-corrected reads may contain some bias errors (induced by error correction step). BTW, ~30-40x reads are not enough if you want to get a high accuracy assembly. I also dot not recommend polishing using single end reads, because of random mapping in high repeat regions for single end reads.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nextomics/NextPolish/issues/54#issuecomment-709316603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB34KVOLYLLZMJ6KSEGMJGDSK3YWBANCNFSM4SR3RWUQ .

moold commented 4 years ago

Hi, 2-3 iterations is ok, but the finally accuracy of an assembly is depending on short reads polishing.

Rob-murphys commented 4 years ago

@moold Hi, Do you mean do not do any form of cleaning on the raw reads, such as with bbtools before using in next polish here?

moold commented 4 years ago

Not exactly, I mean, if you want to get a high-accuracy genome, whether you polished it using long reads or not, you should polish the genome using short reads in the last step. It is difficult to produce a high-accuracy genome using long noise reads only. Of course, it's better to do some cleaning on the short raw reads to remove some low QV reads.

Nextomics / NextPolish

How to polish with long reads and short reads in Single-End mode #54