chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
534 stars 87 forks source link

error when generating fasta with gfa2fa #14

Closed kevfengler227 closed 3 years ago

kevfengler227 commented 4 years ago

Any idea what this error could be about? Or if it would impact the output fasta?

Thanks, KF

gfatools gfa2fa asm.p_ctg.gfa > p.contigs.fasta

[W::gfa_fix_utg] one or multiple A-lines on 'utg000549l' are incorrect [W::gfa_fix_utg] one or multiple A-lines on 'utg001025l' are incorrect [W::gfa_fix_utg] one or multiple A-lines on 'utg001466l' are incorrect [W::gfa_fix_utg] one or multiple A-lines on 'utg003357l' are incorrect

lh3 commented 4 years ago

Could you checkout the latest master branch?

kevfengler227 commented 4 years ago

OK, thanks. I will give this a try. I have been troubleshooting a potential assembly issue and was wondering if this error was related. I have identified several >1 Mb contigs that are missing from the assembly (relative to a HiCanu assembly). However, it appears that the contigs are in the a_ctg.gfa rather than the p_ctg.gfa. How can this happen?

HenrivdGeest commented 4 years ago

Good point I was wondering the same thing. We also have few megabase sized contigs in the alternative. Also be aware that the namings in the a.ctg file are overlapping with p.ctg. i got into trouble later not being aware of it. Maybe worth a tiny fix? But that is not the question of the topic starter,sorry about that.

On Tue, 23 Jun 2020, 19:43 kevfengler227, notifications@github.com wrote:

OK, thanks. I will give this a try. I have been troubleshooting a potential assembly issue and was wondering if this error was related. I have identified several >1 Mb contigs that are missing from the assembly (relative to a HiCanu assembly). However, it appears that the contigs are in the a_ctg.gfa rather than the p_ctg.gfa. How can this happen?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chhylp123/hifiasm/issues/14#issuecomment-648313186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARZCFOUHFTN3DQJ66QJEYLRYDSS7ANCNFSM4OF23UHQ .

chhylp123 commented 4 years ago

Hi kevfengler227,

Hifiasm has a built-in purge_dup so that the p_ctg consists of purged contigs. In contrast, the contigs of hicanu seem not be purged. So I suggest to run Sanger purge_dups on hicanu's contigs to see if these >1 Mb contigs are still there. Please note that since Sanger purge_dups is not specifically designed for HiFi data, it might be over-purged or no enough purged. In my limited companions (hifiasm + built-in purge_dup) is better than (hicanu + Sanger purge_dups) with default options. Of course I might be wrong.

To check if the p_ctg is over-purged or no enough purged, there are two quick options: one is using BUSCO, another is to evaluate the haploid genome size by k-mer. Although sometimes BUSCO is also not such reliable.

chhylp123 commented 4 years ago

I agree the similar contig names are confused. Thanks for pointing that. I will fix it soon :)

Good point I was wondering the same thing. We also have few megabase sized contigs in the alternative. Also be aware that the namings in the a.ctg file are overlapping with p.ctg. i got into trouble later not being aware of it. Maybe worth a tiny fix? But that is not the question of the topic starter,sorry about that. On Tue, 23 Jun 2020, 19:43 kevfengler227, @.***> wrote: OK, thanks. I will give this a try. I have been troubleshooting a potential assembly issue and was wondering if this error was related. I have identified several >1 Mb contigs that are missing from the assembly (relative to a HiCanu assembly). However, it appears that the contigs are in the a_ctg.gfa rather than the p_ctg.gfa. How can this happen? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARZCFOUHFTN3DQJ66QJEYLRYDSS7ANCNFSM4OF23UHQ .

chhylp123 commented 4 years ago

BTW, in theory, at least for diploid genome, the p_ctg of hifiasm might be no enough purged, but shouldn't be over-purged

HenrivdGeest commented 4 years ago

I my case, a huge heterozygous part got purged, although major part does not align with primary part, just the flanks. Also purge dups purged it later again, only raising the minimum identify threshold to 90 got it not purged. Related to this, is it anywhere stored (in a gfa) how the a.ctg relates to the p.ctg? I am highly interested to see what the assembler thinks where these contigs belong. Ideally i would visualize this in bandage or so.

On Tue, 23 Jun 2020, 20:53 chhylp123, notifications@github.com wrote:

BTW, in theory, for diploid genome, the p_ctg of hifiasm might be no enough purged, but shouldn't be over-purged

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chhylp123/hifiasm/issues/14#issuecomment-648351050, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARZCFNBGI6FFUGJMQJX6E3RYD24JANCNFSM4OF23UHQ .

chhylp123 commented 4 years ago

Yean, it's hard to say. We need to do more test on different datasets.

As for relationship, you can manually check by jointly looking p_ctg, a_ctg and r_utg. Both a_ctg and p_ctg are extracted from r_utg. Given a contig P1 in p_ctg, we can first get the reads that are consisted in P1. And then we grep those reads in r_utg to find related unitgs. Finally in a_ctg, we grep the reads consisted in those unitgs on r_utg.

kevfengler227 commented 4 years ago

Indeed. I have been getting record breaking assemblies for a variety of crop plant genomes with hifiasm. I have not noticed this issue before. In this case, I am troubleshooting missing contigs from an 11 Gb inbred hexaploid genome for which I already have a great reference built with HiCanu, BioNano and HiC. Therefore, it easy to to identify "missing" contigs. But HiCanu generated a contig N50 of 30 Mb while Hifiasm produced a contig N50 of 70 Mb! So I am definitely interested to see how the hifiasm assembly can be salvaged. And how it can be used for larger genomes.

However, I do not think the issue is with purge_dup. As I said, I find the missing contigs in the a_ctgs. There is not heterozygosity in this sample, but perhaps similar regions elsewhere in the genome are causing these to be classified as a_ctgs? They are distinct enough to assemble separately and align to the correct BioNano map, but somehow labeled as a different haplotype? Is there a parameter to increase the stringency of p_ctg versus a_ctg?

image

chhylp123 commented 4 years ago

Oh, sorry I misunderstood. If it is an inbred genome, you can disable the built-in purge_dup in hifiasm by '-l 0'. Just rerun hifiasm with bin files will quickly generate the assembly without purge_dup. In some cases, purge_dup is dangerous but we need it for diploid genome. For inbred genome, we shouldn't apply it.

kevfengler227 commented 4 years ago

I do like how purge_dup helps resolve overlapping contigs though. Is there a way to still have it do that? that is the best feature ever.

chhylp123 commented 4 years ago

There are two options you can increase:

-s FLOAT similarity threshold for duplicate haplotigs -O INT min number of overlapped reads for duplicate haplotigs

Increasing these two options will make purge_dups more careful. But as I said, introducing purge_dups for haploid genome might be dangerous.

kevfengler227 commented 4 years ago

awesome suggestions, thanks! I will give these ideas a try and let you know how it goes.

KF

chhylp123 commented 4 years ago

Thanks. Looking forward your results.

kevfengler227 commented 4 years ago

is 0.7-dirty-r256 the latest master branch? I am still seeing this error when redoing assembly from bin file

gfatools gfa2fa asm.p_ctg.gfa > p.contigs.fasta

[W::gfa_fix_utg] one or multiple A-lines on 'utg000549l' are incorrect [W::gfa_fix_utg] one or multiple A-lines on 'utg001025l' are incorrect [W::gfa_fix_utg] one or multiple A-lines on 'utg001466l' are incorrect [W::gfa_fix_utg] one or multiple A-lines on 'utg003357l' are incorrect

kevfengler227 commented 4 years ago

also, is the "l" necessary at the end of utg name? I keep search for utgs with a "1" in stead of a "l" at the end

kevfengler227 commented 4 years ago

OK, I tried turning off purge_dup but I still have a several large contigs from the same region ending up in the a_ctgs. However, I don't understand why because the contigs are unique in the genome (ie all the HiFi reads mapping to the contigs in the original assembly map uniquely). The simple solution is to just combine p_ctgs and a_ctgs but it would be nice to understand what is going on.

Here is the gfa for the a_ctgs. The big graph is composed of all the contigs from the missing region. What is going on here?

image

HenrivdGeest commented 4 years ago

Is is possible that the flanks or parts of the a_ctg are overlapping/similar to a contig in the p_ctg set?

On Fri, 26 Jun 2020, 07:04 kevfengler227, notifications@github.com wrote:

OK, I tried turning off purge_dup but I still have a several large contigs from the same region ending up in the a_ctgs. However, I don't understand why because the contigs are unique in the genome (ie all the HiFi reads mapping to the contigs in the original assembly map uniquely). The simple solution is to just combine p_ctgs and a_ctgs but it would be nice to understand what is going on.

Here is the gfa for the a_ctgs. The big graph is composed of all the contigs from the missing region. What is going on here? [image: image] https://user-images.githubusercontent.com/20604003/85822440-765fc800-b740-11ea-8625-c7dca8feafe2.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chhylp123/hifiasm/issues/14#issuecomment-649968157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARZCFOXVA4TVGRJAAT7SRTRYQT5LANCNFSM4OF23UHQ .

lh3 commented 4 years ago

The gfatools message is due to a bug in gfatools. Try the gfatools github HEAD (not hifiasm HEAD), or use awk

awk '/^S/{print ">"$2;print $3}' graph.gfa > seq.fa

One way to investigate the issue is to examine the r_utg graph. You can get read names on that a_ctg unitig, pull out which unitig(s) in r_utg corresponds to that a_ctg unitig and then inspect the local r_utg graph in bandage. Or you can give a_ctg.noseq.gfa and r_utg.noseq.gfa to us. We may have a look. noseq.gfa has no sequences.

Generally, we have observed that purge_dups may introduce misassemblies. We will try to fix that in the near future.

kevfengler227 commented 4 years ago

the "missing" contigs that end up in the a_ctg all of this graph structure, with a pair of utgs at the ends.

image

chhylp123 commented 4 years ago

Thanks. If you disable purge_dup, is this subgraph still at a_ctg?

kevfengler227 commented 4 years ago

yes

chhylp123 commented 4 years ago

Well, that is not purge_dup problem. Could you please share the corresponding subgraph in r_utg? Thank you so much.

kevfengler227 commented 4 years ago

Like this? I just did a grep for the utg from the noseq.gfa file

utg00456l.noseq.gfa.txt

chhylp123 commented 4 years ago

Sorry for the late reply. Could you please run gfatools like:

./gfatools view -l utg004562l -r 'subset radius' r_utg.noseq.gfa > sub.gfa

We also found some mis-joins introduced by '-l2'. We are thinking how to fix it, looks like we need totally new purge_dups strategy.

kevfengler227 commented 4 years ago

here you go. thanks!

I'd like to see a purge_dup setting optimized for resolving overlapping contigs in inbreds

sub.gfa.txt

chhylp123 commented 4 years ago

Thanks a lot. I'll let you know once it has been fixed.

chhylp123 commented 4 years ago

We have updated 0.8-dirty-r280. It should be able to resolve the purge_dups problems with '-l2' @kevfengler227 . If it still doesn't work, please adjust the following two parameters:

--pri-range INT,INT --purge-cov INT

This version also gives different contig names for p_ctg.gfa and a_ctg.gfa @HenrivdGeest .

kevfengler227 commented 4 years ago

OK, thanks. I'll check this out. In the meantime I have simply been combining the p and a contigs, which is fine for an inbred.

KF

kevfengler227 commented 4 years ago

I am still getting unitigs with this structure in the a_contigs that essentially represent missing contigs from the p_ctgs

image

chhylp123 commented 4 years ago

What's the coverage of this unitig?

kevfengler227 commented 4 years ago

ah yes, now I can tell you that easily :)

The big one is 32x, the smaller unitigs on the left flank are 21x and 11x, while the smaller unitigs on the right are 31x and 33x.

chhylp123 commented 4 years ago

Thanks a lot. Could you let me know the following number on the log file?

[M::adjust_utg_by_primary] primary contig coverage range:

And what's the coverage of the datasets?

kevfengler227 commented 4 years ago

[M::adjust_utg_by_primary] primary contig coverage range: [54, 81]

The actual coverage is ~35x, but here are stats from the first few p_ctgs

S ptg000001l LN:i:53035552 rd:i:31 S ptg000002l LN:i:1104202 rd:i:29 S ptg000003l LN:i:67009668 rd:i:33 S ptg000004l LN:i:71000221 rd:i:32 S ptg000005l LN:i:59450695 rd:i:31 S ptg000006l LN:i:86605882 rd:i:32 S ptg000007l LN:i:67561545 rd:i:32 S ptg000008l LN:i:15013412 rd:i:33 S ptg000009l LN:i:203366888 rd:i:33 S ptg000010l LN:i:122127776 rd:i:33 S ptg000011l LN:i:56031518 rd:i:32 S ptg000012l LN:i:3312017 rd:i:32 S ptg000013l LN:i:103782305 rd:i:32 S ptg000014l LN:i:27864965 rd:i:37 S ptg000015l LN:i:1582018 rd:i:45 S ptg000016l LN:i:131546076 rd:i:33 S ptg000017l LN:i:57486896 rd:i:32 S ptg000018l LN:i:9544135 rd:i:26 S ptg000019l LN:i:187848778 rd:i:32 S ptg000020l LN:i:78780813 rd:i:32 S ptg000021l LN:i:97862876 rd:i:33 S ptg000022l LN:i:26812537 rd:i:31 S ptg000023l LN:i:156256874 rd:i:33 S ptg000024l LN:i:106370367 rd:i:33 S ptg000025l LN:i:104728742 rd:i:32 S ptg000026l LN:i:35191994 rd:i:32 S ptg000027l LN:i:46465324 rd:i:37 S ptg000028l LN:i:83559813 rd:i:32 S ptg000029l LN:i:54971839 rd:i:33 S ptg000030l LN:i:60433326 rd:i:32 S ptg000031l LN:i:58204438 rd:i:32 S ptg000032l LN:i:12685677 rd:i:32

chhylp123 commented 4 years ago

I guess this is an inbred genome. The 'primary contig coverage range' is used to keep all contigs with such coverage at p_ctg, which should be around 35. So the coverage range automatically calculated by hifiasm is wrong : (. One solution is to adjust '--pri-range' to 30,40.

I really appreciate if you can share the full log file with us. I would like to check why the automatically calculated coverage is wrong... Thank you so much.

kevfengler227 commented 4 years ago

that still would not explain how that unitig ended up in a_ctg with 32x coverage? There are p_ctgs with less than that.

Reads has been loaded. Loading ma_hit_ts from disk... ma_hit_ts has been read. Loading ma_hit_ts from disk... ma_hit_ts has been read. [M::ha_assemble::311.1030.97] ==> loaded corrected reads and overlaps from disk Writing raw unitig GFA to disk... Writing processed unitig GFA to disk... Writing primary contig GFA to disk... Writing alternate contig GFA to disk... [M::main] Version: 0.7-dirty-r256 [M::main] CMD: /pangenomes/TOOLS/hifiasm/hifiasm -l 0 -o OAT.asm -t 32 OAT.asm.ec.bin [M::main] Real time: 2658.482 sec; CPU: 2701.479 sec; Peak RSS: 184.506 GB Reads has been loaded. Loading ma_hit_ts from disk... ma_hit_ts has been read. Loading ma_hit_ts from disk... ma_hit_ts has been read. [M::ha_assemble::305.8171.00] ==> loaded corrected reads and overlaps from disk Writing raw unitig GFA to disk... Writing processed unitig GFA to disk... Writing primary contig GFA to disk... Writing alternate contig GFA to disk... [M::main] Version: 0.7-dirty-r256 [M::main] CMD: /pangenomes/TOOLS/hifiasm/hifiasm -l 0 -o OAT.asm -t 24 -s .95 OAT.asm.ec.bin [M::main] Real time: 2624.729 sec; CPU: 2653.614 sec; Peak RSS: 184.506 GB Reads has been loaded. Loading ma_hit_ts from disk... ma_hit_ts has been read. Loading ma_hit_ts from disk... ma_hit_ts has been read. [M::ha_assemble::330.3191.00] ==> loaded corrected reads and overlaps from disk Writing raw unitig GFA to disk... Writing processed unitig GFA to disk... Writing primary contig GFA to disk... Writing alternate contig GFA to disk... [M::main] Version: 0.7-dirty-r256 [M::main] CMD: /pangenomes/TOOLS/hifiasm/hifiasm -l 0 -o OAT.asm -t 24 -s .95 OAT.asm.ec.bin [M::main] Real time: 2772.682 sec; CPU: 2804.904 sec; Peak RSS: 184.506 GB Reads has been loaded. Loading ma_hit_ts from disk... ma_hit_ts has been read. Loading ma_hit_ts from disk... ma_hit_ts has been read. [M::ha_assemble::309.8400.99] ==> loaded corrected reads and overlaps from disk Writing raw unitig GFA to disk... Writing processed unitig GFA to disk... Writing primary contig GFA to disk... Writing alternate contig GFA to disk... [M::main] Version: 0.7-dirty-r256 [M::main] CMD: /pangenomes/TOOLS/hifiasm/hifiasm -o OAT.asm -t 24 -s .99 OAT.asm.ec.bin [M::main] Real time: 2807.922 sec; CPU: 4507.243 sec; Peak RSS: 184.508 GB Reads has been loaded. Loading ma_hit_ts from disk... ma_hit_ts has been read. Loading ma_hit_ts from disk... ma_hit_ts has been read. [M::ha_assemble::320.3890.99] ==> loaded corrected reads and overlaps from disk Writing raw unitig GFA to disk... Writing processed unitig GFA to disk... Writing primary contig GFA to disk... Writing alternate contig GFA to disk... [M::main] Version: 0.7-dirty-r256 [M::main] CMD: /pangenomes/TOOLS/hifiasm/hifiasm -l 0 -o OAT.asm -t 24 OAT.asm.ec.bin [M::main] Real time: 2714.697 sec; CPU: 2746.570 sec; Peak RSS: 184.507 GB Reads has been loaded. Loading ma_hit_ts from disk... ma_hit_ts has been read. Loading ma_hit_ts from disk... ma_hit_ts has been read. [M::ha_assemble::306.3780.99] ==> loaded corrected reads and overlaps from disk [M::ha_analyze_count] lowest: count[4095] = 0 [M::ha_ft_gen::322.1220.99@155.346GB] ==> filtered out 0 k-mers occurring -5 or more times [M::ha_opt_update_cov] updated max_n_chain to 100 Reads has been loaded. [M::ha_pt_gen::1453.16810.83] ==> counted 193828012 distinct minimizer k-mers [M::ha_pt_gen] count[4095] = 91844 (for sanity check) [M::ha_analyze_count] lowest: count[6] = 133515 [M::ha_analyze_count] highest: count[34] = 9957079 [M::ha_hist_line] 1: **** 4410425 [M::ha_hist_line] 2: * 489932 [M::ha_hist_line] 3: 210318 [M::ha_hist_line] 4: 157461 [M::ha_hist_line] 5: 134771 [M::ha_hist_line] 6: 133515 [M::ha_hist_line] 7: 136256 [M::ha_hist_line] 8: 147444 [M::ha_hist_line] 9: 155568 [M::ha_hist_line] 10: 175155 [M::ha_hist_line] 11: 193098 [M::ha_hist_line] 12: 214379 [M::ha_hist_line] 13: 244887 [M::ha_hist_line] 14: * 284389 [M::ha_hist_line] 15: * 334252 [M::ha_hist_line] 16: * 408539 [M::ha_hist_line] 17: 512611 [M::ha_hist_line] 18: * 662238 [M::ha_hist_line] 19: *** 870431 [M::ha_hist_line] 20: * 1144514 [M::ha_hist_line] 21: ***** 1505482 [M::ha_hist_line] 22: **** 1969460 [M::ha_hist_line] 23: ** 2542263 [M::ha_hist_line] 24: **** 3229518 [M::ha_hist_line] 25: **** 4028545 [M::ha_hist_line] 26: * 4922514 [M::ha_hist_line] 27: *** 5835199 [M::ha_hist_line] 28: **** 6798814 [M::ha_hist_line] 29: * 7700190 [M::ha_hist_line] 30: ** 8525183 [M::ha_hist_line] 31: **** 9206661 [M::ha_hist_line] 32: ***** 9698197 [M::ha_hist_line] 33: **** 9941764 [M::ha_hist_line] 34: **** 9957079 [M::ha_hist_line] 35: ** 9727482 [M::ha_hist_line] 36: *** 9288091 [M::ha_hist_line] 37: * 8670765 [M::ha_hist_line] 38: **** 7916731 [M::ha_hist_line] 39: *** 7079573 [M::ha_hist_line] 40: ** 6197537 [M::ha_hist_line] 41: * 5326100 [M::ha_hist_line] 42: ***** 4476665 [M::ha_hist_line] 43: * 3710962 [M::ha_hist_line] 44: ** 3015602 [M::ha_hist_line] 45: **** 2417844 [M::ha_hist_line] 46: **** 1906153 [M::ha_hist_line] 47: 1490371 [M::ha_hist_line] 48: **** 1159643 [M::ha_hist_line] 49: ** 898080 [M::ha_hist_line] 50: 697630 [M::ha_hist_line] 51: ** 549605 [M::ha_hist_line] 52: * 450949 [M::ha_hist_line] 53: ** 384637 [M::ha_hist_line] 54: *** 349127 [M::ha_hist_line] 55: 332445 [M::ha_hist_line] 56: 329526 [M::ha_hist_line] 57: 338896 [M::ha_hist_line] 58: 357823 [M::ha_hist_line] 59: 378477 [M::ha_hist_line] 60: 405362 [M::ha_hist_line] 61: 432800 [M::ha_hist_line] 62: 457017 [M::ha_hist_line] 63: 481262 [M::ha_hist_line] 64: 502740 [M::ha_hist_line] 65: 520144 [M::ha_hist_line] 66: 534159 [M::ha_hist_line] 67: 539785 [M::ha_hist_line] 68: 541200 [M::ha_hist_line] 69: 537127 [M::ha_hist_line] 70: 527242 [M::ha_hist_line] 71: 509534 [M::ha_hist_line] 72: 493151 [M::ha_hist_line] 73: 468805 [M::ha_hist_line] 74: 442945 [M::ha_hist_line] 75: 413826 [M::ha_hist_line] 76: 384334 [M::ha_hist_line] 77: 352303 [M::ha_hist_line] 78: 324584 [M::ha_hist_line] 79: 294710 [M::ha_hist_line] 80: * 265720 [M::ha_hist_line] 81: 239583 [M::ha_hist_line] 82: 216483 [M::ha_hist_line] 83: 196868 [M::ha_hist_line] 84: 178286 [M::ha_hist_line] 85: * 161561 [M::ha_hist_line] 86: 148289 [M::ha_hist_line] 87: 137335 [M::ha_hist_line] 88: 127779 [M::ha_hist_line] 89: 121338 [M::ha_hist_line] 90: 116645 [M::ha_hist_line] 91: 113181 [M::ha_hist_line] 92: 111284 [M::ha_hist_line] 93: 110118 [M::ha_hist_line] 94: 109314 [M::ha_hist_line] 95: 108635 [M::ha_hist_line] 96: 108971 [M::ha_hist_line] 97: 109151 [M::ha_hist_line] 98: 109712 [M::ha_hist_line] 99: 110382 [M::ha_hist_line] 100: 110637 [M::ha_hist_line] 101: 110767 [M::ha_hist_line] 102: 109607 [M::ha_hist_line] 103: 109585 [M::ha_hist_line] 104: 108049 [M::ha_hist_line] 105: 105884 [M::ha_hist_line] 106: 102221 [M::ha_hist_line] 107: 100288 [M::ha_hist_line] 108: 96847 [M::ha_hist_line] 109: 93608 [M::ha_hist_line] 110: 90481 [M::ha_hist_line] 111: 85985 [M::ha_hist_line] 112: 82794 [M::ha_hist_line] 113: 79393 [M::ha_hist_line] 114: 76281 [M::ha_hist_line] 115: 72417 [M::ha_hist_line] 116: 69836 [M::ha_hist_line] 117: 66932 [M::ha_hist_line] 118: 63883 [M::ha_hist_line] 119: 61870 [M::ha_hist_line] 120: 58830 [M::ha_hist_line] 121: 57383 [M::ha_hist_line] 122: 55983 [M::ha_hist_line] 123: 54438 [M::ha_hist_line] 124: 53433 [M::ha_hist_line] 125: 52312 [M::ha_hist_line] 126: 51410 [M::ha_hist_line] 127: 50812 [M::ha_hist_line] 128: 50387 [M::ha_hist_line] rest: ** 5150228 [M::ha_analyze_count] left: none [M::ha_analyze_count] right: count[68] = 541200 [M::ha_pt_gen] peak_hom: 68; peak_het: 34 [M::ha_pt_gen::2766.451*11.82] ==> indexed 8839830481 positions Reads has been loaded. Loading ma_hit_ts from disk... ma_hit_ts has been read. Loading ma_hit_ts from disk... ma_hit_ts has been read. M::hap_recalculate_peaks has done. Writing reads to disk... Reads has been written. Writing ma_hit_ts to disk... ma_hit_ts has been written. Writing ma_hit_ts to disk... ma_hit_ts has been written. bin files have been written. Writing raw unitig GFA to disk... Writing processed unitig GFA to disk... [M::purge_dups] purge duplication coverage threshold: 85 [M::adjust_utg_by_primary] primary contig coverage range: [54, 81] Writing primary contig GFA to disk... Writing alternate contig GFA to disk... [M::main] Version: 0.8-dirty-r280 [M::main] CMD: /pangenomes/TOOLS/hifiasm/hifiasm -l 0 -o OAT.asm -t 24 OAT.asm.ec.bin [M::main] Real time: 6719.900 sec; CPU: 36614.264 sec; Peak RSS: 247.776 GB

chhylp123 commented 4 years ago

It is unavoidable that a few contigs have been mis-classified to a_ctg. The reason might be these regions are very repetitive. But '--pri-range' will recover them. For example, if you manually set:

--pri-range 28,40

All contigs with 28X-40X coverage will be kept on p_ctg. In this case, the longest contig at your example will be recovered to p_ctg. The issue here is hifiasm thinks 34 is the het peak (het coverage) and 68 is the hom peak (hom coverage), so it think this sample is a diploid sample. I need to consider how to make hifiasm ignore the very minor peak 68.

kevfengler227 commented 4 years ago

so how does this contig end up in p_ctg? S ptg000173l LN:i:58354 rd:i:2

chhylp123 commented 4 years ago

The workflow is like: hifiasm first makes some contigs to p_ctg, and make others to a_ctg. This step only relies on phasing information. After that, hifiasm will check the coverage of all contigs at a_ctg, and recover any contig with suitable coverage to p_ctg. So if ptg000173l is classified into p_ctg at first step, hifiasm won't do anything for it. But I guess it is easy to filter out such short contigs with very low coverage, which might be noise.

kevfengler227 commented 4 years ago

OK, thanks. That makes sense now. Sounds good.

KF

chhylp123 commented 4 years ago

Thanks for your helpful suggestion!

kevfengler227 commented 4 years ago

unfortunately, adding the pri-range does not recover the missing contigs from the a_ctgs

hifiasm -l 0 -o OAT.asm -t 24 OAT.asm.ec.bin --pri-range 28,40

example a_ctg atg000005l LN:i:9342622 rd:i:33

There is something strange about the structure below that makes it a_ctg still

image

chhylp123 commented 4 years ago

Let me check if it is a bug...

chhylp123 commented 4 years ago

@kevfengler227 v0.9 uses a stringent threshold so it may miss some contigs. I have released v0.10 which may help. Please check if it works. And for homozygous samples, I recommend you to recover all a_ctg contigs at 35x coverage (your sequencing depth) back to p_ctg.