c-zhou / yahs

Yet another Hi-C scaffolding tool
MIT License
114 stars 16 forks source link

YAHS output is identical to the input assembly #82

Open sarjopp opened 4 months ago

sarjopp commented 4 months ago

For three different assemblies, the output from YAHS is exactly the same as the input assembly. I used the Arima-HiC Mapping Pipeline to map my HiC reads to my assemblies. Assemblies are either HiFi-only (assembled with hifiasm) or HiFi+BioNano. The bam mapping statistics look great.

I tried starting with higher resolution (-r 1000) but got the same result. The log file reports assembly N50 (17086220) too small. Scaffolding anyway

Are my assemblies simply too fragmented for YAHS to succeed? For my organism, this is an exceptionally good N50! The total genome size is 400M and there are 31 chromosomes.

richarddurbin commented 4 months ago

With 31 chromosomes in 400Mb the average chromosome size is 12.9Mb. Your N50 of 17,086,220 is considerably larger than that. I would guess that your assembly is close to having full chromosomes to start with, or at least that well over half the genome is in chromosome-sized pieces. How many contigs do you have? What are the sizes of the top 40 contigs, ranked by size?

Another possibility is that although the HiC data maps well, there are almost no long range pairs. i.e. almost all read paired ends map very close to each other. Then there is no scaffolding signal.

From: Sara Oppenheim @.> Date: Monday, 12 February 2024 at 21:02 To: c-zhou/yahs @.> Cc: Subscribed @.***> Subject: [c-zhou/yahs] YAHS output is identical to the input assembly (Issue #82)

For three different assemblies, the output from YAHS is exactly the same as the input assembly. I used the Arima-HiC Mapping Pipeline to map my HiC reads to my assemblies. Assemblies are either HiFi-only (assembled with hifiasm) or HiFi+BioNano. The bam mapping statistics look great.

I tried starting with higher resolution (-r 1000) but got the same result. The log file reports assembly N50 (17086220) too small. Scaffolding anyway

Are my assemblies simply too fragmented for YAHS to succeed? For my organism, this is an exceptionally good N50! The total genome size is 400M and there are 31 chromosomes.

— Reply to this email directly, view it on GitHubhttps://github.com/c-zhou/yahs/issues/82, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA2FXZUEIYXQRHALDDO4HSDYTJ7ODAVCNFSM6AAAAABDFMITPKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEZTAOJXGY4DQNY. You are receiving this because you are subscribed to this thread.Message ID: @.***>

sarjopp commented 4 months ago

Hi Richard, thanks for your reply! In the post-bionano assembly there are 39 scaffolds, ranging in size from 0,1M to 19.3M. The total length (403.1Mb) matches well with flow cytometry estimates of genome size (~400Mb), so some of the "extra" scaffolds (in cf to the known chromosome number of 31) are presumably fragments of the same chromosome. Which is one of the things I was hoping to resolve with HiC + bionano data.

You asked for sizes, so here you go! Super-Scaffold_37 19.3M Super-Scaffold_4 18.4M Super-Scaffold_9 15.6M Super-Scaffold_25 15.0M Super-Scaffold_1 14.6M Super-Scaffold_12 14.5M Super-Scaffold_97 14.5M Super-Scaffold_31 14.3M Super-Scaffold_33 14.3M Super-Scaffold_26 13.9M Super-Scaffold_19 13.7M Super-Scaffold_83 13.7M Super-Scaffold_2 13.3M Super-Scaffold_18 12.9M Super-Scaffold_28 12.8M Super-Scaffold_110 12.7M Super-Scaffold_5 12.7M Super-Scaffold_85 12.6M Super-Scaffold_23 12.4M Super-Scaffold_100016 12.3M Super-Scaffold_14 11.6M Super-Scaffold_78 11.5M Super-Scaffold_32 10.9M Super-Scaffold_16 10.6M Super-Scaffold_7 10.4M Super-Scaffold_100023 9.4M Super-Scaffold_20 9.4M Super-Scaffold_30 9.4M Super-Scaffold_100026 8.1M Super-Scaffold_13 7.7M Super-Scaffold_24 7.0M Super-Scaffold_100031 5.1M Super-Scaffold_100037 3.8M Super-Scaffold_34 3.8M Super-Scaffold_77 0.3M Super-Scaffold_100141 0.2M Super-Scaffold_100277 0.2M Super-Scaffold_100147 0.1M Super-Scaffold_100148 0.1M

richarddurbin commented 4 months ago

You only have 8 joins to make. There are 5 scaffolds at 0.3Mb or smaller which neeed to be joined to something, and then it isn’t so clear above that, though I guess the remaining three joins are in the smaller remaining scaffolds. I would do this by hand in JuiceBox or something similar. You should be curating the final automated scaffolding anyway. If you look at the HiC map then you should see HiC connections between the smaller pieces and existing larger pieces. If you don’t then either the small pieces are so repetitive that nothing will map to them, or the HiC data are no good (you would see that if there are no off-diagonal data on the big scaffolds in the HiC map).

Richard

From: Sara Oppenheim @.> Date: Tuesday, 13 February 2024 at 01:02 To: c-zhou/yahs @.> Cc: Richard Durbin @.>, Comment @.> Subject: Re: [c-zhou/yahs] YAHS output is identical to the input assembly (Issue #82)

Hi Richard, thanks for your reply! In the post-bionano assembly there are 39 scaffolds, ranging in size from 0,1M to 19.3M. The total length (403.1Mb) matches well with flow cytometry estimates of genome size (~400Mb), so some of the "extra" scaffolds (in cf to the known chromosome number of 31) are presumably fragments of the same chromosome. Which is one of the things I was hoping to resolve with HiC + bionano data.

You asked for sizes, so here you go! Super-Scaffold_37 19.3M Super-Scaffold_4 18.4M Super-Scaffold_9 15.6M Super-Scaffold_25 15.0M Super-Scaffold_1 14.6M Super-Scaffold_12 14.5M Super-Scaffold_97 14.5M Super-Scaffold_31 14.3M Super-Scaffold_33 14.3M Super-Scaffold_26 13.9M Super-Scaffold_19 13.7M Super-Scaffold_83 13.7M Super-Scaffold_2 13.3M Super-Scaffold_18 12.9M Super-Scaffold_28 12.8M Super-Scaffold_110 12.7M Super-Scaffold_5 12.7M Super-Scaffold_85 12.6M Super-Scaffold_23 12.4M Super-Scaffold_100016 12.3M Super-Scaffold_14 11.6M Super-Scaffold_78 11.5M Super-Scaffold_32 10.9M Super-Scaffold_16 10.6M Super-Scaffold_7 10.4M Super-Scaffold_100023 9.4M Super-Scaffold_20 9.4M Super-Scaffold_30 9.4M Super-Scaffold_100026 8.1M Super-Scaffold_13 7.7M Super-Scaffold_24 7.0M Super-Scaffold_100031 5.1M Super-Scaffold_100037 3.8M Super-Scaffold_34 3.8M Super-Scaffold_77 0.3M Super-Scaffold_100141 0.2M Super-Scaffold_100277 0.2M Super-Scaffold_100147 0.1M Super-Scaffold_100148 0.1M

— Reply to this email directly, view it on GitHubhttps://github.com/c-zhou/yahs/issues/82#issuecomment-1939982830, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA2FXZUVY6IY7G4RIRI3ECTYTK3TLAVCNFSM6AAAAABDFMITPKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZZHE4DEOBTGA. You are receiving this because you commented.Message ID: @.***>