aaranyue / quarTeT

A telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification
http://atcgn.com:8080/quarTeT/home.html
93 stars 7 forks source link

Flanking sequence contains gap #8

Closed yangchouxianyu closed 5 months ago

yangchouxianyu commented 1 year ago

hi, Thank you for the software. I used the hifiasm assembly results to fill the gaps in the scaffold and the following error occurred. [Error] Flanking sequence contains gap. Recommend to lower -f parameter or check your file. What could be the reason for this?

Echoring commented 1 year ago

As error said, the flanking 5000bp (default) of a gap contains another gap. GapFiller cannot fill 2 gaps that are too close. If there turely be a gap, lower -f parameter. (This will reduce accuracy) Or you may consider discard the interval sequence and merge two gaps into one, and fill them as whole.

There is a bug that small number of N repesenting unknown bases are identified as gap. If is this case, try updated v1.1.2

JhinAir commented 1 year ago

Hi, I tried the v1.1.2, but still got this error. Could you please check again? @Echoring Thank you!

yangchouxianyu commented 1 year ago

I checked my file and found the error. For your reference.

JhinAir commented 1 year ago

I'm wondering why GapFiller cannot fill 2 gaps that are too close? I think such case could be prevalent.

Echoring commented 1 year ago

GapFiller will cut each gap's frank 5000 bp (default) as anchor, so if there is a gap in anchor, the alignment of anchor will be affected. Meanwhile, if two gaps are so close, it means a very short contig is assembled in. The quarTeT assume the draft genome is assembled by highly continuous contigs (>50000 bp by default), so it haven't been designed to solve this.

Isoris commented 7 months ago

Do you think that asset/detgaps > gaps.bed followed by bedtools merge -d 5000 gaps.bed > merged.gaps.bed and then remasking the fasta is a viable option?

can you give a solution to remove or solve the -f error?

Or implement a function to specify the minimum overlap like the inverse of -f ? OR a filtering options which would allow the user to discard regions that are below the anchor range like mask the interval and work in the other intervals

thanks

Echoring commented 7 months ago

I tried a pre-release to solve this issue. In this new version, if flanking sequence contains gap, it will only be skipped, will not exit the entire program.

Isoris commented 7 months ago

Because those small gaps can be fixed with PILON anyway. Right? Like PILoN then QUARTET?