HybPhaser

Version 2.0 -- this version is a major update from the version 1.2. It contains changes in regards to file generation and output. The workflos should be faster and easier to use. In addition, a number of bugs have been removed.

HybPhaser was developed to deal with hybrids (and polyploids) in target capture datasets.

It detects hybrids by measuring heterozygosity in the dataset and phase hybrid accessions by separating reads according to similarity with selected taxa that represent parental clades.

HybPhaser is built as an extension to the assembly pipeline HybPiper. Check also the new and improved HybPiper 2!

A preprint of the submitted manuscript describing the application of HybPhaser is available at bioRxiv

Installation

HybPhaser Installation

HybPhaser scripts can be downloaded from GitHub

Software dependencies

R (v4.0)
R-packages:
    ape (v5.4)
    seqinR (v4.2)
    stringR (v1.4)
BWA (v0.0.17)
SAMtools (v1.9)
Bcftools (v1.9)
BBSplit (BBMap v38.87)
HybPiper (v1.3.1-2.08)

Data Preparation

Prior to run HybPhaser, sequence assembly has to be performed using HybPiper (Johnson et al. 2016). How to use HybPiper is well-explained in the HybPiper-Wiki.

HybPiper requires the sequence reads and a fasta file with the target sequences. In short, HybPiper pre-selects reads that match to a target gene using BWA or BLAST and then performs a de novo assembly using Spades of the pre-selected read files. Extronerate is then used to extract and concatenate exon regions to generate gene sequences. Optionally intronerate can be used to recover any available intron regions and concatenate these with the exons to generate intronerated contigs.

The output of HybPiper is written in one folder per sample, output regards to single genes is written in subfolders, one per gene. HybPhaser requires several of the output files of HybPiper for further analyses. For each sample, HybPhaser requires the compiled sequences for each gene and the selected reads that mapped to each gene. The 'cleanup.py' script of HybPiper removes only SPAdes assembly files and can be used before running HybPhaser.

How to use HybPhaser

HybPhaser consists of two bash scripts and several R scripts that can be executed from the main script in R-studio or in R directly. All variables necessary to run the scripts have to be set in the configuration file 'config.txt'. The path to the configuration file can be adjusted in the main script of set in R or in R directly using the line config_file \<- "/path/to/config.txt".

Data preparation

sequence assembly in HybPiper

SNPs assessment

generate consensus sequences (bash)
set variables in config.txt
run scripts for part 1 in main script (R)
perform alignments and phylogenetic analyses

Clade association

extract mapped reads (bash, optional)
set variables in config.txt
run scripts for part 2 (R)

Phasing

set variables in config.txt
run scripts for part 3 (R)

Merge datasets

set variables in config.txt
run scripts for part 4 (R)
perform alignments and phylogenetic analyses

1. SNP assessment

The detection of hybrids (and putative paralogous genes) relies on the assessment of SNPs, which is performed in several steps.

Firstly, consensus sequences have to be generated by re-mapping the sequence reads using the denovo assembled sequences from HybPiper as references (from here on called 'contigs', even though they might be stitched supercontigs). This is performed using the command line script "1_generate_consensus_sequences.sh". Consensus sequences contain ambiguity codes for heterozygous sites (SNPs) and thus contains the information on the heterozygosity of the dataset. This information of SNPs as well as recovered sequence length for each gene is collected using R to generate tables and graphs for further assessment. The information on recovered data per sample and gene can be used to remove samples or loci with poor data output to reduce missing data in the dataset. The information on SNPs per sample and gene can be used to remove putative paralogs and detect hybrid accessions. Summary tables and graphs allow assessment of the data and the config file allows setting of thresholds for dataset optimization. Various subsets can be named and resulting assessment tables and graphs are saved in different subfolder according to subset names. Finally, sequence lists are generated per locus or per sample and for contigs or for consensus sequences.

1.1. Consensus sequence generation

Consensus sequences summarize the information of assembled sequence reads and can contain ambiguity characters where reads differ. Differences in sequence reads can occur through divergent alleles, paralogous genes, sequencing errors, or contamination. The generation of consensus sequences allows to measure and assess the occurrence of heterozygous sites across the dataset and to identify hybrid taxa as well as paralogous genes. To generate consensus sequences, HybPhaser requires from the HybPiper output for each sample and each gene the reads mapped to the particular gene and the denovo assembled contig. These files are collected and saved under (01_data) in subfolders for each sample. When paired-end reads were used for the assembly, HybPiper separates mapped reads into two files, "gene_interleaved.fastq" for reads with pairs and "gene_unpaired.fastq" for reads without pairs. HybPhaser combines those files into one file "gene_combined.fasta". Finally, HybPhaser maps for each gene the mapped reads to the denovo contig and generates a consensus sequence that contains ambiguity codes where divergent reads occur. To prevent sequencing errors to be coded as ambiguity codes, variants are only called when there is a coverage of at least 10, a variant occurring at least 4 times and in at least 15% of the reads (parameters for variant calling can be adjusted). Since version 2.1, this script runs the process parallel on all available cores. This speeds up the process a lot, but might impact the use of the computer for other applications. There is an alternative script available that uses only a single core.

The bash script 1_generate_consensus_sequences.sh is executed in the terminal (command line).

Input: denovo contigs (HybPiper), reads mapped to each locus (HybPiper)

Output: Collected mapped reads and contigs from HybPiper for each sample and gene, generated consensus sequences and processing files for mapping and variant calling (.bam, .vcf.gz)

1_generate_consensus_sequences.sh

Options:

-s \<Name of sample> (only required when not providing a namelist)
-n \<Namelist.txt> (txt file with sample names, one per line)
-p \<Path to HybPiper results folder> Default is current folder.
-o \<Path to output folder> (will be created, if it doesn’t exist). Default is ../HybPhaser
-i 'intronerated': If set, intronerate_supercontigs are used in addition to normal contigs.
-c 'clean-up': If set, reads and mapping files are removed (.bam, .vcf.gz)
-d \<value> Minimum coverage on site to be regarded for assigning ambiguity code. If read depth on that site is lower than chosen value, the site is not used for ambiguity code but the most frequent base is returned. Default is 10.
-f \<value> Minimum allele frequency regarded for assigning ambiguity code. If the alternative allele is less frequent than the chosen value, it is not included in ambiguity coding. Default is 0.15.
-a \<value> Minimum count of alleles to be regarded for assigning ambiguity code. If the alternative allele occurs less often than the chosen value, it is not included in ambiguity coding. Default is 4.

Example

1_generate_consensus_sequences.sh -n namelist.txt -p hybpiper_output_folder -o hybphaser_output_folder -t 4

Note: The default setting for variance calling is set to prevent low quality assemblies to result in false positives (SNPs due to contamination or sequencing errors). If the coverage is low (e.g. due to poor sequencing recovery), SNPs will not be called and the proportion of heterozygous sites will be reported lower than it actually is. The setting of minimum allele frequency of 15% to be recorded as SNP is used to prevent contaminations resulting in wrong SNP count, but can underestimate real allele divergence in highly polyploid accessions.

1.2. Assessment of consensus sequences (SNPs count)

The R script 1a_count_snps.R is used to generate a table with the proportions of SNPs in each locus and sample as well as a table with recovered sequence length. Ambiguity characters that code for 3 different nucleotides are weighted twice of the ones coding for 2 different nucleotides.

The script requires several variables to be set in the configuration script config.txt:

path_to_output_folder: set path for HybPhaser output folder ("/path/to/hybphaser_output_folder/")
fasta_file_with_targets: direct to the file with target sequences (baits) ("/path/to/target_file.fasta")
target_file_format : set whether target file is in DNA ("DNA") or AA ("AA") format
path_to_namelist: direct to a file that contains all sample names ("/path/to/namelist.txt")
intronerated_contig: select whether HybPiper was run with intronerate to fetch the intronerated contigs ("yes"), or whether normal contigs should be used ("no")

1.3. Assessment of SNPs and dataset optimization

The R script 1b_assess_dataset.R generated tables and graphs to assess the dataset in regards of sequence length recovered and proportion of SNPs per sample and gene. Tables and graphs are saved in a subfolder (02_assessment) or when a subset name is chosen (02_assessment_subset-name). Different thresholds can be set in the configuration file and results saved under various subsets.

name_for_dataset_optimization_subset : set name for subset ('"_optimized"'), leave empty ("") for no subset

Missing data:

HybPhaser graphs and summary statistics for missing data in samples and loci, as well as recovered sequence length for samples as proportion of the target sequence (or the longest target sequence, if multiple exist for a gene). Thresholds can be set to determine which samples are to be excluded from the dataset.

These variables can be configured in the configuration script:

remove_samples_with_less_than_this_propotion_of_loci_recovered: set threshold for samples to be removed that have a low proportion of the number of loci recovered (e.g. 0.8 will only keep samples that have a sequence recovered for at least 80% of loci)
remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered: set threshold for samples to be removed that have short sequence length recovered (as proportion of the target sequence length). (e.g. 0.6 will only keep samples that have on average 60% of the target sequence length recovered)
remove_loci_with_less_than_this_propotion_of_samples_recovered: set threshold for loci to be removed that have a low proportion of the number of samples recovered (e.g. 0.8 will only keep loci that have a sequence recovered for at least 80% of samples)
remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered: set threshold for loci to be removed that have short sequence length recovered (as proportion of the target sequence length). (e.g. 0.6 will only keep loci that have on average 60% of the target sequence length recovered)

Putative paralogs:

Genes with unusually high proportions of SNPs compared to other genes might have heterozygous sites due to the presence of paralogous genes or contamination. Recent paralogs might be restricted to a single species, while ancient paralogs can be shared across many taxa. Here loci can be removed from the dataset that have high proportion of SNPs across all samples as well as that are 'outlier' in a sample.

These variables can be configured in the configuration script:

remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs: set threshold to remove loci with high mean proportions of SNPs across all samples. (e.g. 0.02 will remove all loci with an average of 2% or SNPS across all samples, or "outliers" will remove all loci that have more than 1.5*IQR (interquartile range) above the 3rd quartile of mean SNPs)
file_with_putative_paralogs_to_remove_for_all_samples: alternatively all loci listed in a text file can be removed (e.g. when using a subset of the data and the same paralogs from another analysis should be removed)
remove_outlier_loci_for_each_sample: set whether outlier loci are to be removed per sample. (e.g. "yes" will remove for each samples loci that have a higher proportion of SNPs than 1.5*IQR (interquartile range) above the 3rd quartile.)

Assessment of heterozygosity and allele divergence

The generated summary table and graphs displaying the locus heterozygosity and allele divergence can be used to make assumptions in regards to hybrid samples. Hybrids are expected to have a high locus heterozygosity (>80%) as well as a considerable allele divergence (>1%). However, these values are relative and can vary depending on the dataset, sequencing quality, study group, and many other factors. They should be carefully evaluated. High allele divergence can indicate a large population with diverse allele variants or in hybrids it can be the divergence between the parental alleles and correlate to the divergence of the parental species. A low allele divergence can indicate low allele diversity, e.g. occurring in isolated populations.

1.4. Sequence lists generation

The R script 1c_generate_sequence_lists.R creates sequence lists for each locus and for each sample for contigs and consensus sequences with the selected thresholds for dataset optimization applied. If a subset is chosen (by setting a subset name), the sequence lists are written in a folder with the subset name.

In total four folders are generated that contain the sequence lists in fasta format:

loci_contig
loci_consensus
sample_contig
sample_consensus

These sequence lists for loci can be aligned in order to perform phylogenetic analyses. The sequence lists for samples can be used for references in the clade association and phasing parts.

2. Clade association

Hybrid accessions contain sequence reads from divergent parents, which can be phased, if the parents are sufficiently divergent. The clade association step of HybPhaser maps all reads onto multiple references representing divergent clades and thus identify hybrids with divergent parents. The software BBSplit (BBMap) is used to simultaneously map the reads to the multiple references and record the proportion of reads mapped unambiguously to each reference.

All required variables are set in the configuration file config.txt. The R scripts will generate a command line script that executes the BBSplit analysis (2a) and summarize the resulting log files into tables (2b).

2.1. Selection of clade references

Based on a framework phylogeny (using the generated sequence lists), the user has to select suitable clade references. Ideally, clades should be selected that have similar phylogenetic distance across the phylogeny. Too many clades will result in less reads mapping unambiguously towards a single reference, and too few clades will result in accessions not mapping to any reference. Further accessions chosen to represent clades should have low locus heterozygosity and low allele divergence but high sequence recovery (number of loci sequenced and proportion of target sequence). A table (CSV format) listing the clade references as well as an abbreviated name needs to be generated by the user.

2.2. Extraction of mapped reads

Read files vary in the proportion of reads that match the target sequences, which has impact on the proportion of reads mapping to the clade references. It is therefore recommended to only use reads for the clade association that mapped to the target sequences. The bash script extract_mapped_reads.sh generates read files for each sample that only contain reads that mapped to the target sequence file. For this, all files containing the mapped reads for each gene are concatenated and duplicated (reads with the same name and sequence) are removed. One can also select to remove all duplicated reads (reads with the same sequence, but can have different names). This might be useful when there is a high coverage, but might skew the results. The script generated new read files, which can be used for the clade association analysis.

2_extract_mapped_reads.sh [options]

Options:

-b \<path> Base folder of either HybPhaser (contains '01_data/'), HybPiper (contains sample directories), or HybPiper-RBGV (contains '04_processed_gene_directories/'). Default is './'
-o \<path> Output folder to write read files into. Default is './mapped_reads/'
-n \<path> Namelist. Optional, if not set, all folders in samples directory are used as samples.
-s Set to remove duplicated sequences (not only sequences with duplicated names).

2.3. BBSplit script preparation / execution

The R script 2a_prepare_bbsplit_results.R can be used to generate the executable command line files that run the clade association mapping. It can also execute the script in R.

Variables to be set in the configuration file config.txt:

path_to_clade_association_folder: path to the clade association output folder (e.g. "/path/to/hybphaser_output/04_clade_association/")
csv_file_with_clade_reference_names: set filename for the previously prepared text file with reference sample names (e.g. "/path/to/hybphaser_output/04_clade_association/clade_references.csv")
path_to_reference_sequences: set folder that contains the sequences of samples to select the clade reference sequences (e.g. "/path/to/hybphaser_output/03_sequence_lists/samples_consensus/")
path_to_read_files_cladeassociation: set folder that contains sequence read files (e.g. "/path/to/hybphaser_output/mapped_reads/")
read_type_cladeassociation: set whether reads are single or paired-end reads (e.g. "single-end", if you use the mapped-only reads)
ID_read_pair1: only required when reads are paired-end; set unique part of filename including the ending (e.g. "_R1.fastq")
ID_read_pair2: similar to “ID_read_pair1”
file_with_samples_included: only needed if not all read files in selected folder should be used. set path to text file that contains a list of all samples included (e.g., "", or "/path/to/list_with_samples.txt")
path_to_bbmap: set folder to bbmap binaries, if bbmap is not in your path variable
no_of_threads: set number of threads/cores to be used for BBSplit (default is 1)
run_clade_association_mapping_in_R: select whether the script is run directly in R (“yes”) or whether to run it manually from command line using the generated script ("no")
java_memory_usage_clade_association: only needed when needed when java -Xmx error comes up. Should be max. 85% of physical memory (e.g. "2G" for 2 GB or "512m" for 512 MB)

2.4. Collation of BBSplit results

The script 2b_collate_bbsplit_results.R generates a summary table based on the BBSplit output files. It displays the proportions of reads from each sample mapping unambiguously to each of the clade references. This table is essential to select accessions that can be phased.

Notes

The proportions of associated reads vary with the references. If references are closely related, less reads are unambiguous for either, if they are too far apart reads may not map unambiguous at all.

It can therefore be useful to run multiple clade association analyses with more or less clades, or with different references. It might also be advantageous to run first a wide scale association and then a fine-scale association.

3. Phasing

Hybrids with reads associated to multiple clades can be phased accordingly into phased accessions that approximate the phased haplotype accessions. BBSplit has the option to map reads to multiple references and distribute reads to new files. Here it is used so reads that map unambiguously to a single reference are assigned to only the regarding read file, while reads that map ambiguously to multiple references are assigned to all read files.

Configure_3_Phasing.R

HybPhaser scripts are used to facilitate the application of BBSplit by generating an executable command line script and generates a summary table of the BBSplit results. All variables for the phasing are set in the configuration file config.txt.

3.1. Selection of accessions for phasing

The selection of suitable accessions for the phasing step has to be done by the user based on the clade association. A table (CSV format) listing the accession to be phased as well as the relevant references needs to be generated by the user. The file needs to be a comma separated file with the following header (columns): samples,ref1,abb1,ref2,abb2,ref3,abb3,...

samples is the sample name that should be phased
ref1 is the sample name of the 1st reference
abb1 is the abbreviation for the 1st reference
…

3.2. BBSplit phasing script preparation and execution

The script 3a_prepare_phasing_script.R can be used to generate the command line script that executes the BBSplit phasing step, or it can be run inside the R script.

Variables to set in config.txt:

path_to_phasing_folder: set folder for phasing output (e.g. "path/to/hybphaser_output/05_phasing/")
csv_file_with_phasing_prep_info: set CSV file with accessions to phase and relevant references (e.g. "path/to/hybphaser_output/05_phasing/phasing_prep.csv")
path_to_read_files_phasing: set path to read files that should be phased (e.g. "/path/to/original_read_files")
read_type_4phasing: set whether reads are paired-end ("paired-end") or single-end ("single-end")
ID_read_pair1: if reads are paired-end, set unique part of filename including the ending (e.g. "_R1.fastq")
ID_read_pair2: if reads are paired-end, set unique part of filename including the ending (e.g. "_R2.fastq")
reference_sequence_folder: set folder for reference sequences (e.g. ".../03_sequence_lists/samples_consensus/")
path_to_bbmap_executables: set path to bbmap executables (if not in path)
no_of_threads_phasing: set number of maximum used threads/cores (default is 1)
run_bash_script_in_R: select whether the script is run directly in R (“yes”) or whether to run it manually from command line using the generated script ("no")
java_memory_usage_phasing: only needed when needed when java -Xmx error comes up. Should be max. 85% of physical memory (e.g. "2G" for 2 GB or "512m" for 512 MB)

3.3. Collation of BBSplit phasing results

The script 3b_collate_phasing_stats.R generated a summary table that displays the proportion of reads that matched unambiguously to each of the references the accession was phased to. The ratio of reads mapped to different references can give insights into the ploidy level, e.g. a F1 hybrid between two parents should roughly have a 1:1 ratio of parental haplotypes, but a tetraploid might have a 3:1 ratio under certain circumstances.

3.4. Assembly of phased accessions

The phased accessions should be assembled similar to the original accessions using HybPiper and HybPhaser (part1). In order to perform phylogenetic analyses on the combined dataset that contain the newly phased as well as the non-phased samples, both datasets have to be merged.

4. Merging phased with normal sequences

The R script 4a_merge_sequence_lists.R can be used to combine the phased with the normal sequence lists. By default all samples included in the sequence lists are used, but it is possible to generate sequence lists with subsets defined in text files listing accession or loci either to be included or excluded.