c-zhou / yahs

Yet another Hi-C scaffolding tool
MIT License
131 stars 19 forks source link

Scaffolding by yahs introduces 200 bp gaps in the assembly #87

Open janina-rinke opened 7 months ago

janina-rinke commented 7 months ago

Hi,

thank you very much for this nice tool, it does a great job of scaffolding our ONT-generated assembly using HiC reads.

However, after scaffolding by yahs, gaps of a standard 200 bp length are introduced and can be seen from thefinal.agp file (see below). I am wondering how this occurs and whether I could set a parameter to have no such gaps in the final scaffolds. For example, I would like to avoid this introduced gap on scaffold_1 from position 19784001-19784200.

Looking at the documentation, I could not find any parameter to deal with the introduced gaps in the agp file. Thanks!

scaffold_1      1       19784000        1       W       old_scaffold_1  1       19784000        +
scaffold_1      19784001        19784200        2       N       200     scaffold        yes     proximity_ligation
scaffold_1      19784201        31747224        3       W       old_scaffold_1  19784001        31747024        +
scaffold_2      1       22623878        1       W       old_scaffold_3  1       22623878        -
scaffold_2      22623879        22624078        2       N       200     scaffold        yes     proximity_ligation
scaffold_2      22624079        22655529        3       W       old_scaffold_14 1       31451   +
Sven-Winter commented 6 months ago

Why would you not want the gaps? It is a scaffold that is supposed to consist of contigs linked by gaps, and Yahs uses a fixed number of 200 Ns. If you want to get rid of them you need to run gapclosing.

MboiTui commented 5 months ago

To my understanding the gaps are there for a reason (e.g., two contigs were found to be contiguous based on contact data but no sequence overlapping the two contigs was found, thus they were put next to each other but with an arbitrary gap of 200bp). If you want to close the gaps the best way is to increase coverage or get ultra long read on top.

c-zhou commented 5 months ago

Thanks @Sven-Winter and @MboiTui

@janina-rinke, yes, we put some N's between two contigs in scaffolds so people know some sequences are missed there.

Best, Chenxi