EndHic is a fast and easy-to-use Hi-C scaffolding tool, using the Hi-C links from contig end regions instead of whole contig regions to assemble large contigs into chromosomal-level scaffolds.
EndHiC takes the HiC-pro's bin matrix results as input data. After running HiC-pro, the recommended EndHiC usage for most users is to run endhic.pl or endhic_iterate.pl. When your contig assembly is quite good, then endhic.pl [one round of EndHiC] is able to finish the job; When your contig assembly is relatively fragmental, then endhic_iterate.pl [multiple rounds of EndHiC] should be used. How many rounds is needed depend on the contig assembly level, and more fragmental contigs need higher rounds of EndHiC.
Two example data are included in EndHiC package:
git clone git@github.com:fanagislab/EndHiC.git
cd EndHiC/z.testing_data/Arabidopsis_thalina sh work.sh
cd EndHiC/z.testing_data/Cichorium_intybus sh work.sh
Note: The Arabidopsis_thalina testing data shows the usage of endhic.pl on long-continuous contig assembly, while the Cichorium_intybus testing data shows the usage of endhic_iterate.pl with relatively shorter contig assembly.
An example data for detecting the assembly errors in the contigs is also included in EndHiC package:
cd EndHiC/z.testing_data/Detect_errors_in_contigs sh work.sh
Usage instructions:
(1) hifiasm.fa.len
Includes two column: contig_id contig_length
(2) humanHiC_100000_abs.bed
Generated by Hic-pro, 100-kb bins, bed format file
(3) humanHiC_100000.matrix
Generated by Hic-pro, 100-kb bins, raw matrix file
(4) humanHiC_100000_iced.matrix
Generated by Hic-pro, 100-kb bins, normalized matrix file
endhic_iterate.pl
endhic.pl
endhic_ctgEnd_pipeline.pl
ctgContact_from_ctgEndContacts.pl
turningpoint_by_lineartransform.pl
scaffold_by_trueCtgContact.pl
cluster_and_classify_GFA.pl
order_and_orient_GFA.pl
(1) Basic usage:
run endhic with specified contig end size and specified contact cutoff
ctgContact_from_ctgEndContacts.pl --binsize 100000 --binnum 10 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix > humanHiC_100000.matrix.100000.10.CtgContact
turningpoint_by_lineartransform.pl humanHiC_100000.matrix.100000.10.CtgContact > humanHiC_100000.matrix.100000.10.CtgContact.adjustTransform 2> humanHiC_100000.matrix.100000.10.CtgContact.turningPoint
scaffold_by_trueCtgContact.pl --contacts 147.07 --reciprocalmax hifiasm.fa.len humanHiC_100000_iced.matrix.100000.10.CtgContact > humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa
cluster_and_classify_GFA.pl humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa > humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa.topology
order_and_orient_GFA.pl --size 2000000 humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa.topology > humanHiC_100000_iced.matrix.100000.10.CtgContact.overCutoff.1.0.reciprocalMax.gfa.cluster
(2) Basic pipeline:
endhic_ctgEnd_pipeline.pl --binsize 100000 --binnum 10 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix
endhic_ctgEnd_pipeline.pl --binsize 100000 --binnum 10 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000_iced.matrix
(3) Standard pipeline: [run only one round of endhic.pl, when contig assembly is quite good]
endhic.pl hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix humanHiC_100000_iced.matrix
(4) Iterative pipeline: [run multiple rounds of endhic.pl, when contig assembly is not so good]
Using default parameters of endhic_iterate.pl endhic_iterate.pl --rounds 3 --binnumstep 5 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix humanHiC_100000_iced.matrix
For more shorter contigs, try to run more rounds with smaller increasing of contig end sizes endhic_iterate.pl --rounds 15 --binnumstep 1 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix humanHiC_100000_iced.matrix
In 01.contig_end_contact_results/
(1) humanHiC_100000.matrix.*.CtgContact Contig contact file, with 7 columns (#CtgId1 CtgId2 EndContact Ctg1Pos Ctg2Pos UsedBinNum1 UsedBinNum2)
(2) humanHiC_100000.matrix.*.CtgContact.adjustTransform Contig contact, adjusted, and linear transformed, to find the turning point
(3) humanHiC_100000.matrix.*.CtgContact.turningPoint Automatically inferred turning point, which will be used as the basic value for the contig contact cutoff
In 02.GFA_contig_graph_results/
(1) humanHiC_100000.matrix.*.CtgContact.overCutoff.1.0.gfa Contig graph in GFA format, contact value over cutoff, can be viewed in Bandage software
(2) humanHiC_100000.matrix.*.CtgContact.overCutoff.1.0.reciprocalMax.gfa Contig graph in GFA format, contact value over cutoff, and satisfy reciprocal best, can be viewed in Bandage software
In 03.cluster_order_orient_results/
(1) humanHiC_100000.matrix.*.CtgContact.overCutoff.1.0.reciprocalMax.gfa.topology Topology of the contig graph, identify linear or circular groups
(2) humanHiC_100000.matrix.*.CtgContact.overCutoff.1.0.reciprocalMax.gfa.cluster Scaffold results, including cluster, order, and orientation information
In 04.summary_and_merging_results/
(1) z.EndHiC.A.results.summary Summary and analysis results for the first loop, merging all the raw and iced results
(2) z.EndHiC.A.results.summary.cluster [Final EndHiC Result] Final scaffold results with high robustness, merging all the raw and iced results This is recommeded to be the final endhic result.
(3) z.EndHiC.A.results.summary.cluster.gfa GFA format of the final scaffold results, which can be graphically viewed in Bandage software
Instruction of *.summary file:
Part 1: Number of clusters under each condition
Part 2: Statistics of all Cluster units
Part 3: Statistics of merged Cluster units
Part 4: Statistics of stable (high frequency) cluster units
Part 5: Statistics of stable cluster units (redundant short contigs removed)
Part 6: Included contigs, total number, total length
Format of *.cluster file:
In total 5 columns
column 1: Cluster id, sorted by cluster length
column 2: Number of contigs included in this cluster
column 3: Cluster length, total length of contigs in this cluster
column 4: robustness, i.e. appearance times in the results from various contig end sizes and contact cutoffs
column 5: Included contigs with order and orientation, separated by ";", and "+-" means strands
e.g. ptg000046l-;ptg000079l+;ptg000058l-;ptg000047l+ (equivalent to ptg000047l-;ptg000058l+;ptg000079l-;ptg000046l+)
(1) draw HiC heatmap for contigs, helpful to find assembly errors in contigs, each point stands for window size 10*100000 = 1 M matrix2heatmap.py humanHiC_100000_abs.bed humanHiC_100000.matrix 10
(2) mapping the unitigs (p_utg) to contigs (p_ctg), only tested for Hifiasm contigs: perl ./map_utg_to_ctg.pl human.p_ctg.noseq.gfa human.p_utg.noseq.gfa > human.utg_to_ctg.map 2> human.utg_to_ctg.map.gfa
(3) Identify the assembly errors in contigs by Hi-C heatmaps and unitig breaks:
perl asm_error_check.pl human_100000_abs.bed human_100000.matrix human.utg_to_ctg.map > human.assmebly_errors.position
perl asm_error_check.pl human_100000_abs.bed human_100000.matrix > human.assmebly_errors.position
The result file contains 10 columns: Ctg_id contig id Error_loc assembly error position inferred from HiC data Inter_contact the count of HiC links between the two bins (500kb-apart) crossing the assembly error position Inter_cutoff the cutoff of Inter_contact, which is 1/10 of the median values from all the inter-bins (500kb-apart) Intra_contact1 the count of HiC links within the left bin crossing the assembly error position Intra_contact2 the count of HiC links within the right bin crossing the assembly error position Intra_cutoff the cutoff of Intra_contact1 and Intra_contact2, which is 1/10 of the median values of all the intra-bins Break_pos assembly error position inferred from unitig break point Utg1_id the left unitig crossing the assembly error position Utg2_id the right unitig crossing the assembly error position
(1) Convert to AGP and Fasta format files
cluster2agp.pl z.EndHiC.A.results.summary.cluster hifiasm.fa.len > scaffolds.agp
agp2fasta.pl scaffolds.agp hifiasm.fa > scaffolds.fa
(2) Draw Hi-C heatmaps for the EndHiC scaffolds
cluster2bed.pl humanHiC_100000_abs.bed z.EndHiC.B.results.summary.cluster > clusterB_100000_abs.bed 2> clusterB.id.len
matrix2heatmap.py clusterB_100000_abs.bed humanHiC_100000.matrix 20
(3) Mapping unclustered short contigs to each cluster
perl ../../../ctgContact_from_ctgEndContacts.pl --binsize 100000 --binnum -1 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix > humanHiC_100000.matrix.halfContig.ctgContact
ctgContact_normalize_distance.pl --normalize humanHiC_100000.matrix.halfContig.ctgContact > humanHiC_100000.matrix.halfContig.ctgContact.normalize
shortCtgs_to_cluster.pl --contact 1 --times 2 z.EndHiC.A.results.summary.cluster hifiasm.fa.len humanHiC_100000.matrix.halfContig.ctgContact.normalize > shortCtgs.mapped.to.clusters.list
(4) Convert EndHiC result to juicebox compatible file formats, which can be viewed in Juicebox
cluster_to_juciebox_assembly.pl contigs.fa.len z.EndHiC.A.results.summary.cluster > draft.assembly
bwa index draft.fa
juicer/misc/generate_site_positions.py MboI draft draft.fa
juicer/CPU/juicer.sh -S early -g draft -s MboI -z ./references/draft.fa -y ./draft_MboI.txt -p ./references/draft.fa.size -t 50 -D juicer/CPU
3d-dna/visualize/run-assembly-visualizer.sh draft.assembly merged_nodups.txt
For more instructions, please refer to the help pages of juicer and juicebox.
(1) Run endhic using the max contact values from bin pairs of two compared contigs endhic_maxBin_pipeline.pl --binsize 1000000 hifiasm.fa.len humanHiC_100000_abs.bed humanHiC_100000.matrix
(2) Apply Hierarchical clustering algorithm with contig distance converted from Hi-C contact values derived from contig end regions
ctgContact_normalize_distance.pl humanHiC_100000.matrix.100000.10.CtgContact > humanHiC_100000.matrix.100000.10.CtgContact.distance
hcluster_contigs.pl --verbose -type min humanHiC_100000.matrix.100000.10.CtgContact.distance hifiasm.fa.len > humanHiC_100000.matrix.100000.10.CtgContact.distance.hcluster.one 2> humanHiC_100000.matrix.100000.10.CtgContact.distance.hcluster
hcluster_suitable_stop.pl --chr_num 23 --chr_len 20000000 humanHiC_100000.matrix.100000.10.CtgContact.distance.hcluster > humanHiC_100000.matrix.100000.10.CtgContact.distance.hcluster.need
(3) Compare two clusters
cluster_compare.pl human.contigs.minimap2.cluster z.EndHiC.A.results.summary.cluster > z.EndHiC.A.results.summary.cluster.vs.ref
Sen Wang, Hengchao Wang, Fan Jiang, Anqi Wang, Hangwei Liu, Hanbo Zhao, Boyuan Yang, Dong Xu, Yan Zhang, Wei Fan. EndHiC: assemble large contigs into chromosomal-level scaffolds using the Hi-C links from contig ends. (2021) https://arxiv.org/abs/2111.15411v1
Wei Fan, 0000-0001-5036-8733 fanwei@cass.cn or fanweiagis@126.com
Sen Wang, 0000-0001-9793-4472 wangsen1993@163.com
Hengchao Wang, 0000-0002-8754-4195 wanghengchao000@qq.com
Fan Jiang, 0000-0003-1359-0970 greatjf@163.com
Yan Zhang, 0000-0003-2281-7807 milrazhang@163.com