c-zhou / yahs

Yet another Hi-C scaffolding tool
MIT License
122 stars 18 forks source link

YAHS merges two chromosomes in alternate haplotype #88

Open MboiTui opened 3 months ago

MboiTui commented 3 months ago

Hello,

I am writing here as I could not find a google group for yahs. This is not a software issue, more of an unexpected behaviour in my data I could use help interpreting.

I am working with Pacbio HiFi and Arima HiC data, and I followed the VGP pipeline to produce the assembly (briefly: merqury > hifiasm > purge duplicates > yahs).

The result is a very contiguous and good assembly overall (1n = 13; L90 = 12, N50 = 224 Mb. Expected genome size is 2.7 Gb)

Yet, for the alternate haplotype, I get two chromosome merged into one. I identified this based on comparisons to linkage maps for the same species, and comparison between the two haplotypes using chromeister. The misjoin is also very obvious in the pretext map for the joined chromosomes.

Linkage map versus haplotype 1 Screen Shot 2024-06-04 at 6 09 00 pm

Linkage map versus haplotype 2 Screen Shot 2024-06-04 at 6 08 48 pm

Haplotype 1 versus Haplotype 2 Screen Shot 2024-06-04 at 6 18 37 pm

Pretext map for misjoined scaffold in haplotype 2 Screen Shot 2024-06-04 at 6 24 42 pm

I was wondering what could be causing this, as the HiC data does not seem to point to many contacts between these two scaffolds? And if possible, how could I address this to avoid the misjoin?

Cheers, LVB

c-zhou commented 3 months ago

Hello @MboiTui,

Yes, it is clear from the HiC map they should not be joined.

Have you zoomed into that region to check the local HiC contacts? Probably at the scale of a few hundred Kb. My guess is that there is a small piece of redundant sequence. They could be haplotype duplications which are too divergent to be identified as homologous sequences by purge_dups but similar enough to be captured by HiC signal. If that is the case, you can remove that piece of sequence (and should probably put it into the other haplotype assembly if it is missed there).

I can comment further if you show me the HiC maps around that point.

Best, Chenxi

MboiTui commented 3 months ago

Hello Chenxi,

Thank you for your reply. Here's a link to a higher definition HiC map for linkage group 1. It's 123 Mb so had to provided as a link. Let me know if another option is preferred.

https://www.dropbox.com/scl/fi/bsnh252tovzs4bc5y72fs/Galaxy213-pretext_snapshotscaffold_1.png?rlkey=94ic7xyf3zvali0cvn6vrbwoy&st=k5omdf2k&dl=0

MboiTui commented 3 months ago

Here are two screenshot of a close up of the area of interest.

Screen Shot 2024-06-07 at 10 31 59 am

Screen Shot 2024-06-07 at 10 32 15 am

c-zhou commented 3 months ago

Hi @MboiTui,

Thanks very much for sharing the image. This is very informative.

This is not the case I mentioned above about the haplotype duplications. Could you help me check one more thing? You should be able to find the position of this misjoining point on this scaffold from the HiC plot. With this position, you can refer to the AGP file generated by YaHS to check if this misjoining point is within a GAP or a contig. If it is within a gap, there is something wrong with the contig joining process; otherwise, this is a misassembly introduced by the assembler and YaHS failed to error-correct it. In either case, I would like to fix it.

I do not think you want to wait for the fix from my end. For your project, you should probably simply break it at this misjoining position. It would be easy if it is within a GAP; otherwise, you may need to use the synteny information with the other haplotype to find the exact breakpoint.

Best, Chenxi

MboiTui commented 2 months ago

Hello @c-zhou,

Thank you for your prompt response. I am finishing up a different project at the moment with a strict timeline. I will get back to you with the info you requested in a couple of weeks.

Thanks for a great software and for the good support.

Best, LVB

c-zhou commented 2 months ago

Thanks @MboiTui! Chenxi

MboiTui commented 2 weeks ago

Hello @c-zhou,

Thank you for your insight.

I have checked the HiC contact map with PretextView, and the gap appears to be at 201,525,000 bp. Looking at the AGP file from YAHS, this is within a contig and not a gap (if i interpreted the AGP file correctly), so it should be an issue within the assembler?

scaffold_1  201157251   201157450 548   N   200 scaffold    yes proximity_ligation
scaffold_1  201157451   201543633 549   W   h2tg001723l_path_1  1   386183  -
scaffold_1  201543634   201543833 550   N   200 scaffold    yes proximity_ligation
scaffold_1  201543834   203019410 551   W   h2tg001684l_path_1  47001   1522577 +
scaffold_1  203019411   203019610 552   N   200 scaffold    yes proximity_ligation
scaffold_1  203019611   203062164 553   W   h2tg006944l_path_1  16001   58554   +

Screen Shot 2024-08-22 at 5 49 21 pm

I will use synteny information with the other haplotype to identify the breakpoint

Thank you for your help, I appreciate it