chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
505 stars 84 forks source link

segmentation fault error and core dumped #93

Open daniazi opened 3 years ago

daniazi commented 3 years ago

Hello,

This is similar to issue#69 but for me it is unsolved yet. I am running hifiasm v 0.14.2-r315 to assemble ~3.5Gb genome. My hifi fastq files contain 1M reads each (total ~5.5 million reads) with an average read length of ~20Kb. As before, Hifiasm keeps crashing after the first step with a Segmentation fault. There is no output in the output directory except a file core.30130 file of 50Gb. I am running it with 48 cores and 480Gb of RAM.

My command hifiasm -o assembly.asm -t 48 1.fastq 2.fastq 3.fastq 4.fastq 5.fastq

The log is attached. hifiasm_assembly1.txt

Any idea what's wrong there?

chhylp123 commented 3 years ago

The histogram peak looks weird, so that hifiasm cannot find correct peaks for assembly. I'm confused why your HiFi data has such peaks. Have you check contamination?

HenrivdGeest commented 3 years ago

I would make a kmer plot of your hifi data first. I think this should always be done, users should see already peaks at expected coverage prior running hifiasm. Genomscope works perfectly on hifi data, even with kmers upto 256. But 31 is fine for the kmer plot. With the distribution you now showed it almost looks like CLR data, you sure its hifi? If its hifi, genomescope should be able to predict your genome size.

On Thu, Apr 8, 2021 at 5:34 PM chhylp123 @.***> wrote:

The histogram peak looks weird, so that hifiasm cannot find correct peaks for assembly. I'm confused why your hifiasm data has such peaks. Have you check contamination?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chhylp123/hifiasm/issues/93#issuecomment-815920903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARZCFP6ROFDFFUBICKKMI3THXEGZANCNFSM42QQBTGA .

daniazi commented 3 years ago

The histogram peak looks weird, so that hifiasm cannot find correct peaks for assembly. I'm confused why your HiFi data has such peaks. Have you check contamination?

I am doing the contamination check at the moment with blobtools but I will also check if it is possible with Kraken. But don't you think the assembler should go on to produce some assembly (even fragmented one) regardless of contamination?

daniazi commented 3 years ago

I would make a kmer plot of your hifi data first. I think this should always be done, users should see already peaks at expected coverage prior running hifiasm. Genomscope works perfectly on hifi data, even with kmers upto 256. But 31 is fine for the kmer plot. With the distribution you now showed it almost looks like CLR data, you sure its hifi? If its hifi, genomescope should be able to predict your genome size. On Thu, Apr 8, 2021 at 5:34 PM chhylp123 @.***> wrote: The histogram peak looks weird, so that hifiasm cannot find correct peaks for assembly. I'm confused why your hifiasm data has such peaks. Have you check contamination? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#93 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARZCFP6ROFDFFUBICKKMI3THXEGZANCNFSM42QQBTGA .

Thanks Herni. I did try to obtain unique kmers using Jellyfish to see the genomic characteristics. Please see the plots at k=21 with GenomeScope (readlength=20000 and max kmer cov=400 and =3000. This doesn't show any signs of contamination, right? What do you think of them in general?

cov 400 cov 3000

000).

HenrivdGeest commented 3 years ago

I am surprised that genomescope was able to fit the model at all, I think you can ignore the predicted genome sizes, since you do not see a clear peak(s). A few possibilities, a) you have many contaminants present in low coverage. b) your sample is not pure, its a very heterozygous genome, and you pooled many individuals. c) your hifi (Q20+) data is contaminated with CLR reads. d) a possibility I did not think off....

Hicanu will probably assemble this into something, since its not so sensitive for the kmer peaks. Maybe that will help you identify your problem.

daniazi commented 3 years ago

The results from Canu assembly made me try other assemblers, and I found Hifiasm. DNA was obtained from only one individual. I am not about CLR contamination, I got only fastq files from the facility. Is there a way to check for CLR reads?

But my question right now is why the assembler stops with segmentation fault. Is it somehow related to contamination or kmer distribution? Maybe I should also try -f38 with aggressive purging.

lh3 commented 3 years ago

Is there a way to check for CLR reads?

The k-mer spectrum suggests your dataset is almost certainly CLR, not HiFi. Then hifiasm won't work. You can try Canu's CLR mode.

B10inform commented 3 years ago

Hi all,

I am getting similar error. I have tried with -l0, -I1 and without these.

/var/spool/slurmd/job33689861/slurm_script: line 82: 23511 Segmentation fault hifiasm -o PB53.hifi_reads.asm -l0 /home/hifi_reads.fastq.gz 2> PB53.hifi_reads.log

I have already tried four times i get the same error. There is no any error mentioning time limit and or memory.

Any help would be great.

Thanks

PB53.hifi_reads.log.log

chhylp123 commented 3 years ago

Was hifiasm terminated with 'Writing reads to disk...' each time? Have you checked if hifiasm successfully generated some bin files? Does your cluster have enough space in disk?

B10inform commented 3 years ago

Yes each time hifiasm terminated with "Writing reads to disk...". Each time the end looks like this: [M::ha_print_ovlp_stat] # overlaps: 158404749 [M::ha_print_ovlp_stat] # strong overlaps: 124952433 [M::ha_print_ovlp_stat] # weak overlaps: 33452316 [M::ha_print_ovlp_stat] # exact overlaps: 154520295 [M::ha_print_ovlp_stat] # inexact overlaps: 3884454 [M::ha_print_ovlp_stat] # overlaps without large indels: 158228473 [M::ha_print_ovlp_stat] # reverse overlaps: 94270991 Writing reads to disk...

It did not produce any files. Only file generated is the log file. I have enough space in my cluster.

chhylp123 commented 3 years ago

Is it possible that you can share the fastq files with us? I cannot understand this issue which occurred during writing bin files... It looks like no enough disk space. Thank you in advance.

B10inform commented 3 years ago

Hi sorry for the late response. I got busy in something else. The fastq files are big, I thought of sending portion of it (before sending i tried to look if the portion of fastq works, and i got no error). I believe fastq file is ok.

However, I started again, changed bam to fastq and than ran hifasm. Now i am getting memory issues "slurmstepd: error: Detected 1 oom-kill event(s) in step 34676091.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler."

My genome size is ~3.2Mb how much memory is good for genome of this size?

Best

chhylp123 commented 3 years ago

It depends on fastq. What's the size of your fastq file?

B10inform commented 3 years ago

It is ~143911174KB.

chhylp123 commented 3 years ago

Hard to say. I guess 100Gb RAM is safe.

iggyB commented 3 years ago

Hej,

Just wanted to clarify that data set used by @daniazi is purely HiFi.

The genome in question is relatively complex and most likely triploid.

K-mer plots indicate insufficient coverage or presence of contaminants.

Cheers, Iggy

chhylp123 commented 3 years ago

@iggyB Just curious: have you figured out how to solve it?

iggyB commented 3 years ago

@chhylp123 I did a rather simple test and ran Redbean on this data set. Since Redbean produces haplotype collapsed assemblies it worked. Very fragmented 3.4 Gbp assembly (which is close to predicted 1N size). My recommendation is to significantly increase coverage and do hifiasm again :)

chhylp123 commented 3 years ago

@iggyB Thank you so much! But how did this happen? In theory the coverage is enough. Is it caused by coverage bias?

mgarl-10 commented 2 years ago

Hello,

This is similar to issue#69 but for me it is unsolved yet. I am running hifiasm v 0.14.2-r315 to assemble ~3.5Gb genome. My hifi fastq files contain 1M reads each (total ~5.5 million reads) with an average read length of ~20Kb. As before, Hifiasm keeps crashing after the first step with a Segmentation fault. There is no output in the output directory except a file core.30130 file of 50Gb. I am running it with 48 cores and 480Gb of RAM.

My command hifiasm -o assembly.asm -t 48 1.fastq 2.fastq 3.fastq 4.fastq 5.fastq

The log is attached. hifiasm_assembly1.txt

Any idea what's wrong there?

Hello,

I'm having the same problem with my assemblies. Did you find an explanation? Is it related with low coverage?

chhylp123 commented 2 years ago

If you also have weird k-mer plot, it should be caused by insufficient coverage or presence of contaminants. See https://hifiasm.readthedocs.io/en/latest/faq.html#why-does-hifiasm-stuck-or-crash

mgarl-10 commented 2 years ago

If you also have weird k-mer plot, it should be caused by insufficient coverage or presence of contaminants. See https://hifiasm.readthedocs.io/en/latest/faq.html#why-does-hifiasm-stuck-or-crash

Do you suggest to check for contaminants and try the assembly with Hifiasm again?

chhylp123 commented 2 years ago

I guess so.

B10inform commented 2 years ago

If you also have weird k-mer plot, it should be caused by insufficient coverage or presence of contaminants. See https://hifiasm.readthedocs.io/en/latest/faq.html#why-does-hifiasm-stuck-or-crash

Hi chhylp123

My k-mer plot looks like this. image

It still has segmentation fault error with hifiasm -o PB53.hifi_reads.asm -l0 /home/hifi_reads.fastq.gz 2> PB53.hifi_reads.log. However it works fine removing -I0.

chhylp123 commented 2 years ago

Interesting... Are you using v0.15.5?

B10inform commented 2 years ago

Yes, Version 2 also gives similar output.

chhylp123 commented 2 years ago

I haven't seen cases that -l3 works but -l0 doesn't work … Is this possible that you can share the bin files with us for debugging? I really appreciate that.

Raysun61 commented 1 year ago

Hello,

This is similar to issue#69 but for me it is unsolved yet. I am running hifiasm v 0.14.2-r315 to assemble ~3.5Gb genome. My hifi fastq files contain 1M reads each (total ~5.5 million reads) with an average read length of ~20Kb. As before, Hifiasm keeps crashing after the first step with a Segmentation fault. There is no output in the output directory except a file core.30130 file of 50Gb. I am running it with 48 cores and 480Gb of RAM.

My command hifiasm -o assembly.asm -t 48 1.fastq 2.fastq 3.fastq 4.fastq 5.fastq

The log is attached. hifiasm_assembly1.txt

Any idea what's wrong there?

Hi!

I meet the same problem, HiFi reads' k-mer distribution is also so weird. I wondering if you find the reason? Because of contamination or the genome speciality?

zt_new.log http://qb.cshl.edu/genomescope/analysis.php?code=KYiRVLeS8wqB3YUFDxOz

image

In fact, I test short reads using genomescope1. But, the result is also strange! I have no idea about this!😵‍💫 http://qb.cshl.edu/genomescope/analysis.php?code=cJ0m7Tq4PFe7FdOX7Ptv

image
chhylp123 commented 1 year ago

I feel like in most cases, this should be the contamination issue.

xiekunwhy commented 1 year ago

Hi all,

You may want to try kmerDedup (https://github.com/xiekunwhy/kmerDedup) if you are assembling pooled sample (and get larger assembly size than expected) and/or want to select longest contig sets from difference assembling software.

When there is no peaks using hifiasm, and you are sure that there is no contaminant, you can try some other softwares like flye ipa and so on, then cat the results together and choose longest non-redundancy contig/scaffold set.

Best, Kun

ywddwed commented 1 year ago

Hello, This is similar to issue#69 but for me it is unsolved yet. I am running hifiasm v 0.14.2-r315 to assemble ~3.5Gb genome. My hifi fastq files contain 1M reads each (total ~5.5 million reads) with an average read length of ~20Kb. As before, Hifiasm keeps crashing after the first step with a Segmentation fault. There is no output in the output directory except a file core.30130 file of 50Gb. I am running it with 48 cores and 480Gb of RAM. My command hifiasm -o assembly.asm -t 48 1.fastq 2.fastq 3.fastq 4.fastq 5.fastq The log is attached. hifiasm_assembly1.txt Any idea what's wrong there?

Hi!

I meet the same problem, HiFi reads' k-mer distribution is also so weird. I wondering if you find the reason? Because of contamination or the genome speciality?

zt_new.log http://qb.cshl.edu/genomescope/analysis.php?code=KYiRVLeS8wqB3YUFDxOz image

In fact, I test short reads using genomescope1. But, the result is also strange! I have no idea about this!😵‍💫 http://qb.cshl.edu/genomescope/analysis.php?code=cJ0m7Tq4PFe7FdOX7Ptv image

Hi, I meet the same trouble and my HiFi reads' k-mer distribution is so weird. The genome scope of target species is about 1.4Gb and My hifi fastq files contain total 46Gbase with an average read length of ~16Kb. I think the coverage is enough. I was wondering if you figured out the cause and ultimately what you did with the raw data to solve this problem.

Best, ydw

nohup.txtait

peiqi0807 commented 1 month ago

你好 这类似于问题#69,但对我来说它尚未解决。我正在运行 hifiasm v 0.14.2-r315 来组装 ~3.5Gb 基因组。我的高保真 fastq 文件每个包含 1M 次读取(总共 ~550 万次读取),平均读取长度为 ~20Kb。和以前一样,Hifiasm 在第一步后不断崩溃,出现分段错误。输出目录中没有输出,除了 50Gb 的文件 core.30130 文件。我使用 48 个内核和 480Gb 的 RAM 运行它。 我的命令 hifiasm -o assembly.asm -t 48 1.fastq 2.fastq 3.fastq 4.fastq 5.fastq 日志已附加。hifiasm_assembly1.txt 知道那里出了什么问题吗?

你好!

我遇到了同样的问题,HiFi reads 的 k-mer 分布也很奇怪。我想知道你是否找到原因?因为污染还是基因组的特殊性?

zt_new.log http://qb.cshl.edu/genomescope/analysis.php?code=KYiRVLeS8wqB3YUFDxOz 图像

事实上,我使用 genomescope1 测试短读长。但是,结果也很奇怪!我对此一无所知! 😵 💫 http://qb.cshl.edu/genomescope/analysis.php?code=cJ0m7Tq4PFe7FdOX7Ptv 图像

Hello, my distribution map is similar to your question. Do you have a good solution in the future?