GreenHill is a de novo chromosome-level scaffolding and phasing tool using Hi-C. GreenHill generates chromosome-level haplotypes by scaffolding and phasing the input contigs using a combination of information from Hi-C and other reads (PE, MP, LongRead).
If you use GreenHill in your work, please cite:
Ouchi, S., Kajitani, R. & Itoh, T. GreenHill: a de novo chromosome-level scaffolding and phasing tool using Hi-C. Genome Biol 24, 162 (2023). https://doi.org/10.1186/s13059-023-03006-8
Shun Ouchi and Rei Kajitani at Tokyo Institute of Technology wrote key source codes. Address for this tool: platanus@bio.titech.ac.jp
GCC
Minimap2
Install the dependencies above.
Compile (make), and copy greenhill to a directory listed in PATH.
git clone https://github.com/ShunOuchi/GreenHill.git
cd src
make
cp greenhill <installation_path>
On a new conda environment, run:
conda install -c bioconda greenhill
Or, if you want to have a separate conda environment for GreenHill
conda create -n greenhill -c bioconda greenhill
You will then need to activate the greenhill envirionment before using it with:
conda activate greenhill
for Haplotype-aware style input
greenhill \
-c nonBubble.fa \
-b primaryBubble.fa secondaryBubble.fa \
-IP1 PE_1.fq PE_2.fq \
-OP2 MP_1.fq MP_2.fq \
-p longread.fq \
-HIC HIC_1.fq HIC_2.fq \
2>3D.log
for Pseudo-haplotype or Mixed-haplotype style input
greenhill \
-cph contigs.fa \
-IP1 PE_1.fq PE_2.fq \
-OP2 MP_1.fq MP_2.fq \
-p longread.fq \
-HIC HIC_1.fq HIC_2.fq \
2>3D.log
out_afterPhase.fa (phased diploid scaffolds)
Below is showing examples how to run GreenHill using test dataset. The test dataset is the simulated diploid dataset of Caenorhabditis elegans chr1.
greenhill \
-c Platanus-allee_result/out_nonBubbleOther.fa \
-b Platanus-allee_result/out_primaryBubble.fa Platanus-allee_result/out_secondaryBubble.fa \
-IP1 reads/PE_*.fq.gz \
-OP2 reads/MP5k_*.fq.gz \
-OP3 reads/MP9k_*.fq.gz \
-p reads/longread.fq.gz \
-HIC reads/HIC_1.fq.gz reads/HIC_2.fq.gz
greenhill \
-cph FALCON-Unzip_result/cns_p_ctg.fa FALCON-Unzip_result/cns_h_ctg.fa \
-p reads/longread.fq.gz \
-HIC reads/HIC_1.fq.gz reads/HIC_2.fq.gz
greenhill \
-cph Canu_result/asm.contigs.fa \
-p reads/longread.fq.gz \
-HIC reads/HIC_1.fq.gz reads/HIC_2.fq.gz
greenhill [OPTIONS] 2>log
-o STR : prefix of output file and directory (do not use "/", default out, length <= 200)
-c FILE1 [FILE2 ...] : contig (or scaffold) file (fasta format; for Haplotype-aware style input)
-b FILE1 [FILE2 ...] : bubble seq file (fasta format; for Haplotype-aware style input)
-cph FILE1 [FILE2 ...] : contig (or scaffold) file (fasta format; for Pseudo-haplotype or Mixed-haplotype style input; only effective without -c, -b option)
-ip{INT} PAIR1 [PAIR2 ...] : lib_id inward_pair_file (interleaved file, fasta or fastq)
-IP{INT} FWD1 REV1 [FWD2 REV2 ...] : lib_id inward_pair_files (separate forward and reverse files, fasta or fastq)
-op{INT} PAIR1 [PAIR2 ...] : lib_id outward_pair_file (interleaved, fasta or fastq)
-OP{INT} FWD1 REV1 [FWD2 REV2 ...] : lib_id outward_pair_files (separate forward and reverse files, fasta or fastq)
-p PAIR1 [PAIR2 ...] : long-read file (PacBio, Nanopore) (fasta or fastq)
-hic PAIR1 [PAIR2 ...] : HiC_pair_files (reads in 1 file, fasta or fastq)
-HIC FWD1 REV1 [FWD2 REV2 ...] : HiC_pair_files (reads in 2 files, fasta or fastq)
-t INT : number of threads (default 1)
-tmp DIR : directory for temporary files (default .)
-l INT : minimum number of links to scaffold (default 3)
-k INT : minimum number of links to phase variants (default 1)
-s INT1 [INT2 ...] : mapping seed length for short reads (default 32 64 96)
-mapper FILE : path of mapper executable file (default minimap2, only effective with -p option)
-minimap2_sensitive : sensitive mode for minimap2 (default, off; only effective with -p option)
Uncompressed and compressed (gzip or bzip2) files are accepted for -c, -ip, -IP, -op, -OP, -p, -hic and -HIC option.
PREFIX_afterPhase.fa
PREFIX_*
PREFIX is specified by -o
Resulting scaffolds can be reviewed and curated with Juicebox Assembly Tool (JBAT). This can be accomplished using the programs below:
path_juicer=/path/to/juicer
path_3d=/path/to/3d_dna_pipeline
path_greenhill=/path/to/greenhill
seqkit sort -lr out_afterPhase.fa >base.fa bwa index base.fa >bwa_index.log 2>&1 seqkit fx2tab -nl base.fa >base.sizes
juicer.sh -D $path_juicer -d $PWD -g base -s none -z base.fa -p base.sizes >juicer.log.o 2>juicer.log.e awk -f $path_3d/utils/generate-assembly-file-from-fasta.awk base.fa >base.assembly 2>generate.log.e $path_3d/visualize/run-assembly-visualizer.sh base.assembly aligned/merged_nodups.txt >visualizer.log.o 2>visualizer.log.e python $path_greenhill/utils/fasta_to_juicebox_assembly.py base.fa >base.ctg_info.assembly
Then, you can input `base.hic` and `base.ctg_info.assembly` into [Juicebox](https://github.com/aidenlab/Juicebox). See the [cookbook](https://aidenlab.org/assembly/manual_180322.pdf) for the details of the review process.
![JBAT screenshot](images/JBAT_screenshot.png)
Finally, the reviwed assembly file, `base.ctg_info.review.assembly` (output of "Export Assembly" in Juicebox), is converted into the final FASTA file.
```sh
$path_3d/run-asm-pipeline-post-review.sh -r base.ctg_info.review.assembly base.fa aligned/merged_nodups.txt >post_review.log.o 2>post_review.log.e
The following table shows the statics of several results assembled with GreenHill v1.1.0. | Species | Input Reads | Input assembly | Total(Mb) | N50(Mb) | Peak Memory(GB) | Runtime(h) |
---|---|---|---|---|---|---|---|
C.elegans | PE + CLR + Hi-C | Platanus-allee | 208.8 | 17.0 | 23.54 | 0.53 | |
Zebra finch | CLR + Hi-C | FALCON-Unzip | 2025.9 | 70.6 | 92.06 | 19.41 | |
Black rhinoceros | HiFi + Hi-C | Hifiasm | 5325.7 | 52.3 | 26.80 | 206.37 |
Runtime were measured on a computer with an Intel(R) Xeon(R) Gold 6342 CPU (2.80 GHz clocks, dual 24 cores).
For more information, please see the paper.
Both uncompressed and compressed (gzip or bzip2) FASTA/FASTQ files are accepted. Formats are auto-detected. Internally, "file -bL", "gzip -cd" and "bzip2 -cd" commands, which can be used in most of the UNIX OSs, are utilized.
This tool is used to align PacBio/Oxford-Nanopore long reads and to do self align. When long reads are input through the -p option, please check Minimap2 is installed as "minimap2" command or specify the path of Minimap2 using the -mapper option.
Paired libraries are classified into "inward-pair" and "outward-pair" according to the sequence direction. For file formats, separate and interleaved files can be input through -IP (-OP) and -ip (-op) options, respectively.
Inward-pair (usually called "paired-end", accepted in options "-IP" or "-ip"):
FWD --->
5' -------------------- 3'
3' -------------------- 5'
<--- REV
Outward-pair (usually called "mate-pair", accepted in options "-OP" or "-op"):
---> REV
5' -------------------- 3'
3' -------------------- 5'
FWD <---
Example inputs:
Inward-pair (separate, insert=300) : PE300_1.fq PE300_2.fq
Inward-pair (interleaved, insert=500): PE500_pair.fq
Outward-pair (separate, insert=2k) : MP2k_1.fa MP2k_2.fq
Corresponding options:
-IP1 PE300_1_pair.fq PE300_2.fq \
-ip2 PE500_pair.fq \
-OP3 MP2k_1.fq MP2k_2.fq