fanagislab / EndHiC

EndHic is a fast and easy-to-use Hi-C scaffolding tool, using the Hi-C links from contig end regions instead of whole contig regions to assemble large contigs into chromosomal-level scaffolds.
19 stars 3 forks source link

non-model species #3

Open andreaschavez opened 1 year ago

andreaschavez commented 1 year ago

Hi: I am studying a non-model diploid mammal species with a large genome size (~6GB). I have 30X Hifi data, as well as Hi-C data from Dovetail Genomics using their Chicago libraries. I have made an assembly in HiFiASM with the HiFi data. I would like to use EndHiC and am trying to use Hi-C Pro to generate input files for EndHiC. Hi-C Pro wants a table file of chromosome sizes. I don't have chromosome size information for my species. Can I create a table of scaffold lengths from my HiFiasm assembly? Or do these methods only really work on species with already known genome information? This question might be more appropriate for the Hi-C Pro folks, but I thought I would also check here. Thanks in advance. best, Andreas

kingforest93 commented 1 year ago

Hi Andreas,

You are right. You can use a table of contig or scaffold lengths as the input file of "chromosome sizes" or "GENOME_SIZE" for HiC-Pro, and make sure that the lengths exactly match the contig or scaffold sequences. HiC-Pro can handle any genomic sequences, both chromosome-level reference genome and contig/scaffold-level draft. EndHiC is designed for the chromosome-level assembly of any species with HiFi and Hi-C data, and prior genome size and chromosome number are not necessary. By the way, you need to check the contig or scaffold N50/N90 size, since EndHiC works best with large contigs.

Sen Wang

andreaschavez commented 1 year ago

Hi Sen: Thank you for responding. Below are my bbstats genome-assembly metrics from our HiFiasm assembly. You can see the contig lengths in the table. Also, would it be better to use the full contig assembly from HIFIasm or one of the haplotype-phased blocks? Thank you. Andreas

Main genome scaffold total:             1281 Main genome contig total:               1281 Main genome scaffold sequence total:    6131.895 MB Main genome contig sequence total:      6131.895 MB     0.000% gap Main genome scaffold N/L50:             58/28.024 MB Main genome contig N/L50:               58/28.024 MB Main genome scaffold N/L90:             31/43.668 MB Main genome contig N/L90:               31/43.668 MB Max scaffold length:                    121.696 MB Max contig length:                      121.696 MB Number of scaffolds > 50 KB:            867 % main genome in scaffolds > 50 KB:     99.79%

Minimum         Number          Number          Total           Total           Scaffold Scaffold        of              of              Scaffold        Contig          Contig   Length          Scaffolds       Contigs         Length          Length          Coverage --------        --------------  --------------  --------------  --------------  --------     All                  1,281           1,281   6,131,894,624   6,131,894,624   100.00%     500                  1,281           1,281   6,131,894,624   6,131,894,624   100.00%    1 KB                  1,280           1,280   6,131,893,704   6,131,893,704   100.00%  2.5 KB                  1,279           1,279   6,131,891,897   6,131,891,897   100.00%    5 KB                  1,278           1,278   6,131,888,419   6,131,888,419   100.00%   10 KB                  1,276           1,276   6,131,877,782   6,131,877,782   100.00%   25 KB                  1,169           1,169   6,129,615,205   6,129,615,205   100.00%   50 KB                    867             867   6,119,277,438   6,119,277,438   100.00%  100 KB                    754             754   6,111,568,205   6,111,568,205   100.00%  250 KB                    631             631   6,090,699,183   6,090,699,183   100.00%  500 KB                    532             532   6,055,877,007   6,055,877,007   100.00%    1 MB                    442             442   5,991,897,596   5,991,897,596   100.00%  2.5 MB                    342             342   5,828,361,974   5,828,361,974   100.00%    5 MB                    252             252   5,499,084,476   5,499,084,476   100.00%   10 MB                    156             156   4,800,604,825   4,800,604,825   100.00%   25 MB                     66              66   3,282,147,106   3,282,147,106   100.00%   50 MB                     26              26   1,920,521,017   1,920,521,017   100.00%  100 MB                      4               4     430,271,999     430,271,999   100.00%

kingforest93 commented 1 year ago

Hi Andreas: The assembly continuity is very good, and I think EndHiC work well for this genome. The assembly sequences directly from hifiasm are "contigs" but not "scaffolds", because a contig can not contain gaps while a scaffold normally contains gaps jointing several contigs by correct order and orientation. Hope your work goes well! Sen

andreaschavez commented 1 year ago

Thanks again, Sen. I had a word slip in the first message. I hope this works for us. I am curious, what is the largest genome you have or others have scaffolded with EndHiC that you know of? We haven't had very good success with other scaffolding programs. Thanks. Andreas

kingforest93 commented 1 year ago

For a diploid plant genome of ~ 7Gb, I have ever used EndHiC to complete the chromosome-level assembly. In my opinion, large genome size is not a chanllenge for a scaffolding program, and what matters is the genome complexity, like tandem repeat, chromatin conformation, etc. Sen