LarsNauheimer / HybPhaser

Workflow to detect and phase hybrids in target capture data
GNU General Public License v3.0
15 stars 5 forks source link

HybPhaser

Version 2.0 -- this version is a major update from the version 1.2. It contains changes in regards to file generation and output. The workflos should be faster and easier to use. In addition, a number of bugs have been removed.

HybPhaser was developed to deal with hybrids (and polyploids) in target capture datasets.

It detects hybrids by measuring heterozygosity in the dataset and phase hybrid accessions by separating reads according to similarity with selected taxa that represent parental clades.

HybPhaser is built as an extension to the assembly pipeline HybPiper. Check also the new and improved HybPiper 2!

A preprint of the submitted manuscript describing the application of HybPhaser is available at bioRxiv

Installation

HybPhaser Installation

HybPhaser scripts can be downloaded from GitHub

Software dependencies

R (v4.0)
R-packages:
    ape (v5.4)
    seqinR (v4.2)
    stringR (v1.4)
BWA (v0.0.17)
SAMtools (v1.9)
Bcftools (v1.9)
BBSplit (BBMap v38.87)
HybPiper (v1.3.1-2.08)

Data Preparation

Prior to run HybPhaser, sequence assembly has to be performed using HybPiper (Johnson et al. 2016). How to use HybPiper is well-explained in the HybPiper-Wiki.

HybPiper requires the sequence reads and a fasta file with the target sequences. In short, HybPiper pre-selects reads that match to a target gene using BWA or BLAST and then performs a de novo assembly using Spades of the pre-selected read files. Extronerate is then used to extract and concatenate exon regions to generate gene sequences. Optionally intronerate can be used to recover any available intron regions and concatenate these with the exons to generate intronerated contigs.

The output of HybPiper is written in one folder per sample, output regards to single genes is written in subfolders, one per gene. HybPhaser requires several of the output files of HybPiper for further analyses. For each sample, HybPhaser requires the compiled sequences for each gene and the selected reads that mapped to each gene. The 'cleanup.py' script of HybPiper removes only SPAdes assembly files and can be used before running HybPhaser.

How to use HybPhaser

HybPhaser consists of two bash scripts and several R scripts that can be executed from the main script in R-studio or in R directly. All variables necessary to run the scripts have to be set in the configuration file 'config.txt'. The path to the configuration file can be adjusted in the main script of set in R or in R directly using the line config_file \<- "/path/to/config.txt".

Data preparation

  1. sequence assembly in HybPiper

SNPs assessment

  1. generate consensus sequences (bash)
  2. set variables in config.txt
  3. run scripts for part 1 in main script (R)
  4. perform alignments and phylogenetic analyses

Clade association

  1. extract mapped reads (bash, optional)
  2. set variables in config.txt
  3. run scripts for part 2 (R)

Phasing

  1. set variables in config.txt
  2. run scripts for part 3 (R)

Merge datasets

  1. set variables in config.txt
  2. run scripts for part 4 (R)
  3. perform alignments and phylogenetic analyses

1. SNP assessment

The detection of hybrids (and putative paralogous genes) relies on the assessment of SNPs, which is performed in several steps.

Firstly, consensus sequences have to be generated by re-mapping the sequence reads using the denovo assembled sequences from HybPiper as references (from here on called 'contigs', even though they might be stitched supercontigs). This is performed using the command line script "1_generate_consensus_sequences.sh". Consensus sequences contain ambiguity codes for heterozygous sites (SNPs) and thus contains the information on the heterozygosity of the dataset. This information of SNPs as well as recovered sequence length for each gene is collected using R to generate tables and graphs for further assessment. The information on recovered data per sample and gene can be used to remove samples or loci with poor data output to reduce missing data in the dataset. The information on SNPs per sample and gene can be used to remove putative paralogs and detect hybrid accessions. Summary tables and graphs allow assessment of the data and the config file allows setting of thresholds for dataset optimization. Various subsets can be named and resulting assessment tables and graphs are saved in different subfolder according to subset names. Finally, sequence lists are generated per locus or per sample and for contigs or for consensus sequences.

1.1. Consensus sequence generation

Consensus sequences summarize the information of assembled sequence reads and can contain ambiguity characters where reads differ. Differences in sequence reads can occur through divergent alleles, paralogous genes, sequencing errors, or contamination. The generation of consensus sequences allows to measure and assess the occurrence of heterozygous sites across the dataset and to identify hybrid taxa as well as paralogous genes. To generate consensus sequences, HybPhaser requires from the HybPiper output for each sample and each gene the reads mapped to the particular gene and the denovo assembled contig. These files are collected and saved under (01_data) in subfolders for each sample. When paired-end reads were used for the assembly, HybPiper separates mapped reads into two files, "gene_interleaved.fastq" for reads with pairs and "gene_unpaired.fastq" for reads without pairs. HybPhaser combines those files into one file "gene_combined.fasta". Finally, HybPhaser maps for each gene the mapped reads to the denovo contig and generates a consensus sequence that contains ambiguity codes where divergent reads occur. To prevent sequencing errors to be coded as ambiguity codes, variants are only called when there is a coverage of at least 10, a variant occurring at least 4 times and in at least 15% of the reads (parameters for variant calling can be adjusted). Since version 2.1, this script runs the process parallel on all available cores. This speeds up the process a lot, but might impact the use of the computer for other applications. There is an alternative script available that uses only a single core.

The bash script 1_generate_consensus_sequences.sh is executed in the terminal (command line).

Input: denovo contigs (HybPiper), reads mapped to each locus (HybPiper)

Output: Collected mapped reads and contigs from HybPiper for each sample and gene, generated consensus sequences and processing files for mapping and variant calling (.bam, .vcf.gz)

1_generate_consensus_sequences.sh

Options:

Example

1_generate_consensus_sequences.sh -n namelist.txt -p hybpiper_output_folder -o hybphaser_output_folder -t 4

Note: The default setting for variance calling is set to prevent low quality assemblies to result in false positives (SNPs due to contamination or sequencing errors). If the coverage is low (e.g. due to poor sequencing recovery), SNPs will not be called and the proportion of heterozygous sites will be reported lower than it actually is. The setting of minimum allele frequency of 15% to be recorded as SNP is used to prevent contaminations resulting in wrong SNP count, but can underestimate real allele divergence in highly polyploid accessions.

1.2. Assessment of consensus sequences (SNPs count)

The R script 1a_count_snps.R is used to generate a table with the proportions of SNPs in each locus and sample as well as a table with recovered sequence length. Ambiguity characters that code for 3 different nucleotides are weighted twice of the ones coding for 2 different nucleotides.

The script requires several variables to be set in the configuration script config.txt:

1.3. Assessment of SNPs and dataset optimization

The R script 1b_assess_dataset.R generated tables and graphs to assess the dataset in regards of sequence length recovered and proportion of SNPs per sample and gene. Tables and graphs are saved in a subfolder (02_assessment) or when a subset name is chosen (02_assessment_subset-name). Different thresholds can be set in the configuration file and results saved under various subsets.

Missing data:

HybPhaser graphs and summary statistics for missing data in samples and loci, as well as recovered sequence length for samples as proportion of the target sequence (or the longest target sequence, if multiple exist for a gene). Thresholds can be set to determine which samples are to be excluded from the dataset.

These variables can be configured in the configuration script:

Putative paralogs:

Genes with unusually high proportions of SNPs compared to other genes might have heterozygous sites due to the presence of paralogous genes or contamination. Recent paralogs might be restricted to a single species, while ancient paralogs can be shared across many taxa. Here loci can be removed from the dataset that have high proportion of SNPs across all samples as well as that are 'outlier' in a sample.

These variables can be configured in the configuration script:

Assessment of heterozygosity and allele divergence

The generated summary table and graphs displaying the locus heterozygosity and allele divergence can be used to make assumptions in regards to hybrid samples. Hybrids are expected to have a high locus heterozygosity (>80%) as well as a considerable allele divergence (>1%). However, these values are relative and can vary depending on the dataset, sequencing quality, study group, and many other factors. They should be carefully evaluated. High allele divergence can indicate a large population with diverse allele variants or in hybrids it can be the divergence between the parental alleles and correlate to the divergence of the parental species. A low allele divergence can indicate low allele diversity, e.g. occurring in isolated populations.

1.4. Sequence lists generation

The R script 1c_generate_sequence_lists.R creates sequence lists for each locus and for each sample for contigs and consensus sequences with the selected thresholds for dataset optimization applied. If a subset is chosen (by setting a subset name), the sequence lists are written in a folder with the subset name.

In total four folders are generated that contain the sequence lists in fasta format:

These sequence lists for loci can be aligned in order to perform phylogenetic analyses. The sequence lists for samples can be used for references in the clade association and phasing parts.

2. Clade association

Hybrid accessions contain sequence reads from divergent parents, which can be phased, if the parents are sufficiently divergent. The clade association step of HybPhaser maps all reads onto multiple references representing divergent clades and thus identify hybrids with divergent parents. The software BBSplit (BBMap) is used to simultaneously map the reads to the multiple references and record the proportion of reads mapped unambiguously to each reference.

All required variables are set in the configuration file config.txt. The R scripts will generate a command line script that executes the BBSplit analysis (2a) and summarize the resulting log files into tables (2b).

2.1. Selection of clade references

Based on a framework phylogeny (using the generated sequence lists), the user has to select suitable clade references. Ideally, clades should be selected that have similar phylogenetic distance across the phylogeny. Too many clades will result in less reads mapping unambiguously towards a single reference, and too few clades will result in accessions not mapping to any reference. Further accessions chosen to represent clades should have low locus heterozygosity and low allele divergence but high sequence recovery (number of loci sequenced and proportion of target sequence). A table (CSV format) listing the clade references as well as an abbreviated name needs to be generated by the user.

2.2. Extraction of mapped reads

Read files vary in the proportion of reads that match the target sequences, which has impact on the proportion of reads mapping to the clade references. It is therefore recommended to only use reads for the clade association that mapped to the target sequences. The bash script extract_mapped_reads.sh generates read files for each sample that only contain reads that mapped to the target sequence file. For this, all files containing the mapped reads for each gene are concatenated and duplicated (reads with the same name and sequence) are removed. One can also select to remove all duplicated reads (reads with the same sequence, but can have different names). This might be useful when there is a high coverage, but might skew the results. The script generated new read files, which can be used for the clade association analysis.

2_extract_mapped_reads.sh [options]

Options:

2.3. BBSplit script preparation / execution

The R script 2a_prepare_bbsplit_results.R can be used to generate the executable command line files that run the clade association mapping. It can also execute the script in R.

Variables to be set in the configuration file config.txt:

2.4. Collation of BBSplit results

The script 2b_collate_bbsplit_results.R generates a summary table based on the BBSplit output files. It displays the proportions of reads from each sample mapping unambiguously to each of the clade references. This table is essential to select accessions that can be phased.

Notes

The proportions of associated reads vary with the references. If references are closely related, less reads are unambiguous for either, if they are too far apart reads may not map unambiguous at all.

It can therefore be useful to run multiple clade association analyses with more or less clades, or with different references. It might also be advantageous to run first a wide scale association and then a fine-scale association.

3. Phasing

Hybrids with reads associated to multiple clades can be phased accordingly into phased accessions that approximate the phased haplotype accessions. BBSplit has the option to map reads to multiple references and distribute reads to new files. Here it is used so reads that map unambiguously to a single reference are assigned to only the regarding read file, while reads that map ambiguously to multiple references are assigned to all read files.

Configure_3_Phasing.R

HybPhaser scripts are used to facilitate the application of BBSplit by generating an executable command line script and generates a summary table of the BBSplit results. All variables for the phasing are set in the configuration file config.txt.

3.1. Selection of accessions for phasing

The selection of suitable accessions for the phasing step has to be done by the user based on the clade association. A table (CSV format) listing the accession to be phased as well as the relevant references needs to be generated by the user. The file needs to be a comma separated file with the following header (columns): samples,ref1,abb1,ref2,abb2,ref3,abb3,...

3.2. BBSplit phasing script preparation and execution

The script 3a_prepare_phasing_script.R can be used to generate the command line script that executes the BBSplit phasing step, or it can be run inside the R script.

Variables to set in config.txt:

3.3. Collation of BBSplit phasing results

The script 3b_collate_phasing_stats.R generated a summary table that displays the proportion of reads that matched unambiguously to each of the references the accession was phased to. The ratio of reads mapped to different references can give insights into the ploidy level, e.g. a F1 hybrid between two parents should roughly have a 1:1 ratio of parental haplotypes, but a tetraploid might have a 3:1 ratio under certain circumstances.

3.4. Assembly of phased accessions

The phased accessions should be assembled similar to the original accessions using HybPiper and HybPhaser (part1). In order to perform phylogenetic analyses on the combined dataset that contain the newly phased as well as the non-phased samples, both datasets have to be merged.

4. Merging phased with normal sequences

The R script 4a_merge_sequence_lists.R can be used to combine the phased with the normal sequence lists. By default all samples included in the sequence lists are used, but it is possible to generate sequence lists with subsets defined in text files listing accession or loci either to be included or excluded.

Variables to set in config.txt:

The combined sequence lists are ready for alignment and phylogenetic analyses.