chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
528 stars 86 forks source link

What is the strategy for Ultra-long ONT integration? #657

Open polchan opened 3 months ago

polchan commented 3 months ago

Hello, Cheng!

I would like to ask about the strategy for using Ultra-long ONT integration. Is the purpose of the ONT sequence to assist with scaffolding, or will it be merged into the final sequence? I currently have published ONT data for the same species. Can I use this data to assist in assembling another sample of the same species? The two samples only differ in their cultivation varieties.

Thank you!

zhaotao1987 commented 3 months ago

I have exactly the same question.

polchan commented 3 months ago

@lh3

chhylp123 commented 3 months ago

@zhaotao1987 @polchan Sorry for the late reply since I was quite busy this month. The short answer is no. UL reads are mainly used to resolve repeats and fill gaps, instead of scaffolding. So it would be better to have all data from the same sample.

zhaotao1987 commented 2 months ago

Dear Haoyu, @chhylp123 @polchan Thanks very much for your response, I see. And I think 'scaffolding' maybe confusing here, I would like to confirm with you and put my question as such: By saying 'fill gaps', do you mean that UL can help resolve more repetitive regions of the string graph (from the hifi reads), so that such regions can be correctly joined and as a result, the number of gaps reduced (thus, 'fill gaps'). Or, you mean, at some stage, UL reads will be used to fill the gaps, using an approach as generally deemed (finding overlaps with contig ends and fill gaps). The differences are, for the first scenario, the exact sequences of UL will not be used, just for alignment and finding the correct path, the second scenario means the detail sequences of the UL reads will be integrated into the assembly as well. I guess you probably mean the first scenario? As in the gfa file which lead to the final assembly, it seems to me only the CCS reads are included, not including any UL reads, also I think UL reads are error-prone, it seems hifiasm does not perform reads-correcting for UL reads before integrating. So if it is scenario one, I think using close related species ONT or even hifi reads assembly maybe doable, just to solve the repetitive regions of the string graph, made from its own hifi reads.

I hope I made myself clear, thanks very much!

Best, Tao

chhylp123 commented 2 months ago

Hifiasm actually takes advantage of UL read in both scenarios. But in practice, scenario 2 is rare.

polchan commented 1 month ago

Thanks very much for your response. We have a new question! We have identified sequences not only from CCS but also from ONT (SRR24941509) in the gfa file. However, we are unsure where the sequences labeled as 'scaf' originated from." And why are all these 'scaf' sequences 116 base pairs in length?

A h2tg000002l 31694941 - scaf 0 116 id:i:2720614 HG:A:m A h2tg000003l 19927517 + SRR24941509.98734 0 580 id:i:2720615 HG:A:m A h2tg000009l 21858274 + scaf 0 116 id:i:2720629 HG:A:m A h2tg000016l 20021719 + scaf 0 116 id:i:2720618 HG:A:a A h2tg000017l 12012600 + scaf 0 116 id:i:2720638 HG:A:m A h2tg000262c 48300 - SRR24941509.81037 0 14119 id:i:2720650 HG:A:m