chrisjackson-pellicle / hybpiper-nf

Nextflow and Singularity/Conda pipeline for running HybPiper (https://github.com/mossmatters/HybPiper)
GNU General Public License v3.0
6 stars 2 forks source link

hybpiper-nf

Original HybPiper github, wiki and tutorial:

For an explanation of the general purpose of HybPiper, and the approach it takes to generate target locus sequences from sequence capture data, please see the documentation, wiki and tutorial at the mossmatters repo:

hybpiper-nf: containerised and pipelined using Singularity and Nextflow

To simplify running HybPiper, I’ve provided a Singularity container based on the Linux distribution Ubuntu 22.04, with all the software required to run the HybPiper pipeline (including some additional functionality, see below for details), as well as all the dependencies (BioPython, BLAST, BWA, BBmap, Exonerate, SPAdes, Samtools). The container is called hybpiper-paragone.sif.

To run HybPiper using this container, I’ve provided a Nextflow pipeline that uses the software in the Singularity container. This pipeline runs all HybPiper steps with a single command. The pipeline script is called hybpiper.nf. It comes with an associated config file called hybpiper.config. The only input required is a folder of sequencing reads for your samples, and a target file in .fasta format. The Nextflow pipeline will automatically generate the namelist.txt file required by some of the HybPiper scripts, and will run all HybPiper scripts on each sample in parallel. It also includes an optional read-trimmming step to QC your reads prior to running HybPiper, using the software Trimmomatic. The number of parallel processes running at any time, as well as computing resources given to each process (e.g. number of CPUs, amount of RAM etc) can be configured by the user by modifying the provided config file. The pipeline can be run directly on your local computer, and on an HPC system submitting jobs via a scheduler (e.g. SLURM, PBS, etc).

Input data

Target file

Your target file, which can contain either nucleotide OR protein sequences (see below for corresponding pipeline options), should follow the formatting described for HybPiper here. Briefly, your target genes should be grouped and differentiated by a suffix in the fasta header, consisting of a dash followed by an ID unique to each gene, e.g.:

>AJFN-4471
AATGTTATACAGGATGAAGAGAAACTGAATACTGCAAACTCCGATTGGATGCGGAAATACAAAGGCT...
>Ambtr-4471
AGTGTTATTCAAGATGAAGATGTATTGTCGACAGCCAATGTGGATTGGATGCGGAAATATAAGGGCA...
>Ambtr-4527
GAGGAGCGGGTGATTGCCTTGGTCGTTGGTGGTGGGGGTAGAGAACATGCTCTATGCTATGCTTTGC...
>Arath-4691
GAGCTTGGATCTATCGCTTGCGCAGCTCTCTGTGCTTGCACTCTTACAATAGCTTCTCCTGTTATTG...
>BHYC-4691
GAAGTGAACTGTGTTGCTTGTGGGTTTCTTGCTGCTCTTGCTGTCACTGCTTCTCCCGTAATCGCTG...
etc...

Read files

You will need to provide the hybpiper.nf pipeline with either:

a) A directory of paired-end forwards and reverse reads (and, optionally, a file of single-end reads) for each sample; or

b) A directory of single-end reads.

For the read files to be recognised by the pipeline, they should be named according to the default convention:

*_R1.fastq 
*_R2.fastq
*_single.fastq (optional; will be used if running with the flag `--paired_and_single`)

OR

*_R1.fq 
*_R2.fq
*_single.fq (optional; will be used if running with the flag `--paired_and_single`)

NOTE:

Running on Linux

Please see the Wiki entry Running on Linux.

Running on a Mac (macOS)

Please see the Wiki entry Running on a Mac.

NOTE: Macs using the new Apple M1 chip are not yet supported.

Running on a PC (Windows)

Please see the Wiki entry Running on a PC.

Nextflow pipeline options and parameters

Example run command:

nextflow run hybpiper.nf -c hybpiper.config -entry assemble -profile standard_singularity --illumina_reads_directory reads_for_hybpiper --targetfile_dna Angiosperms353_targetSequences.fasta
Mandatory arguments:

  ############################################################################

  --illumina_reads_directory <directory>    
                              Path to folder containing illumina read file(s)

  AND

  --targetfile_dna <file>    File containing fasta sequences of target genes
                              (nucleotides)

  OR

  --targetfile_aa <file>     File containing fasta sequences of target genes
                              (amino-acids)

  #############################################################################

Optional arguments:

  -profile <profile>          Configuration profile to use. Can use multiple 
                              (comma separated). Available: standard (default), 
                              standard_singularity, slurm_singularity, conda, 
                              conda_slurm

  --namelist                  A text file containing sample names. Only these 
                              samples will be processed, By default, all samples 
                              in the provided <Illumina_reads_directory> 
                              directory are processed

  --combine_read_files        Group and concatenate read-files via a common prefix. 
                              Useful if samples have been run across multiple lanes. 
                              Default prefix is all text preceding the first 
                              underscore (_) in read filenames

  --combine_read_files_num_fields <int>     
                              Number of fields (delimited by an underscore) to use 
                              for combining read files when using the 
                              `--combine_read_files` flag. Default is 1

  --num_forks <int>           Specify the number of parallel processes (e.g. 
                              concurrent runs of 'hybpiper assemble') to run at any 
                              one time. Can be used to prevent Nextflow from using 
                              all the threads/cpus on your machine. Default is 
                              to use the maximum number possible  

  --outdir <directory_name>                 
                              Specify the name of the pipeline results directory. 
                              Default is 'results'        

  --paired_and_single         Use when providing both paired-end R1 and R2 read 
                              files as well as a file of single-end reads for each 
                              sample       

  --single_end                Use when providing providing only a folder of 
                              single-end reads                         

  --read_pairs_pattern <pattern>            
                              Provide a comma-separated read-pair pattern for 
                              matching fowards and reverse paired-end readfiles, 
                              e.g. '1P,2P'. Default is 'R1,R2'

  --single_pattern <pattern>                
                              Provide a pattern for matching single-end read 
                              files. Default is 'single'

  #######################################################################################
  ################################ Trimmomatic options: #################################
  #######################################################################################

  --use_trimmomatic           Trim forwards and reverse reads using Trimmomatic.
                              Default is off

  --trimmomatic_leading_quality <int>       
                              Cut bases off the start of a read, if below this 
                              threshold quality.Default is 3

  --trimmomatic_trailing_quality <int>      
                              Cut bases off the end of a read, if below this 
                              threshold quality. Default is 3

  --trimmomatic_min_length <int>            
                              Drop a read if it is below this specified length. 
                              Default is 36

  --trimmomatic_sliding_window_size <int>   
                              Size of the sliding window used by Trimmomatic; 
                              specifies the number of bases to average across. 
                              Default is 4

  --trimmomatic_sliding_window_quality <int>
                              Specifies the average quality required within the 
                              sliding window. Default is 20

  #######################################################################################
  ############################# hybpiper assemble options: ##############################
  #######################################################################################

  --bwa                       Use BWA to search reads for hits to target. Requires
                              BWA and a target file that is nucleotides!

  --diamond                   Use DIAMOND instead of BLASTx

  --diamond_sensitivity       Use the provided sensitivity for DIAMOND searches. 
                              Option are: 'mid-sensitive', 'sensitive', 
                              'more-sensitive', 'very-sensitive', 'ultra-sensitive'

  --distribute_low_mem        Distributing and writing reads to individual gene 
                              directories  will be 40-50 percent slower, but can use 
                              less memory/RAM with large input files

  --evalue                    e-value threshold for blastx/DIAMOND hits, default: 0.0001

  --max_target_seqs           Max target seqs to save in BLASTx search, default is 10

  --cov_cutoff <int>          Coverage cutoff to pass to the SPAdes assembler. 
                              Default is 8

  --single_cell_assembly      Run SPAdes assemblies in MDA (single-cell) mode. 
                              Default is False

  --kvals                     Values of k for SPAdes assemblies. SPAdes needs to be 
                              compiled to handle larger k-values! Default is 
                              auto-detection by SPAdes

  --thresh                    Percent identity threshold for retaining Exonerate
                              hits. Default is 55, but increase this if you are 
                              worried about contaminant sequences

  --paralog_min_length_percentage <decimal> 
                              Minimum length percentage of a SPAdes contig vs 
                              reference protein query for a paralog warning to be 
                              generated and a putative paralog contig to be 
                              recovered. Default is 0.75 

  --depth_multiplier          Assign a long paralog as the "main" sequence if it 
                              has a coverage depth <depth_multiplier> times all 
                              other long paralogs. Set to zero to not use depth. 
                              Default is 10

  --timeout_assemble          Kill long-running gene assemblies if they take longer 
                              than X percent of average

  --timeout_exonerate_contigs Kill long-running processes if they take longer than 
                              X seconds. Default is 120

  --target                    Use the target file sequence with this taxon name in 
                              Exonerate searches for each gene. Other targets for 
                              that gene will be used only for read sorting. Can be a 
                              tab-delimited file (one <gene>\t<taxon_name> per line) 
                              or a single taxon name

  --exclude                   Do not use any sequence with the specified taxon name 
                              string in Exonerate searches. Sequenced from this 
                              taxon will still be used for read sorting

  --no_stitched_contig        Do not create stitched contigs; use longest Exonerate 
                              hit only. Default is off

  --chimera_test_memory <int> Memory (RAM) amount in MB to use for bbmap.sh when
                              peforming stitched-contig chimera tests. Default is 
                              1000 MB

  --bbmap_subfilter <int>     Ban alignments with more than this many 
                              substitutions when performing read-pair mapping to 
                              supercontig reference (bbmap.sh). Default is 7

  --chimeric_stitched_contig_edit_distance <int>    
                              Minimum number of base differences between one read 
                              of a read pair vs the stitched-contig reference for a 
                              read pair to be flagged as discordant. Default is 5

  --chimeric_stitched_contig_discordant_reads_cutoff <int>           
                              Minimum number of discordant reads pairs required 
                              to flag a stitched-contig as a potential chimera of 
                              contigs from multiple paralogs. Default is 5

  --exonerate_hit_sliding_window_size <int>
                              Size of the sliding window (in amino-acids) when 
                              trimming termini of Exonerate hits. Default is 3.

  --exonerate_hit_sliding_window_thresh <int>
                              Percentage similarity threshold for the sliding window 
                              (in amino-acids) when trimming termini of Exonerate hits. 
                              Default is 55.

  --merged                    Merge forward and reverse reads, and run SPAdes 
                              assembly with merged and unmerged (the latter 
                              in interleaved format) data. Default is off

  --run_intronerate           Run the intronerate() function to recover intron 
                              and supercontig sequences. Default is off, and so 
                              fasta files in `subfolders 09_sequences_intron` and 
                              `10_sequences_supercontig` will be empty

  --keep_intermediate_files   Keep all intermediate files and logs, which can be 
                              useful for debugging. Default action is to delete 
                              them, which greatly reduces the total file number

  --no_padding_supercontigs   If Intronerate is run, and a supercontig is created 
                              by concatenating multiple SPAdes contigs, do not add 
                              10 "N" characters between contig joins. By default, 
                              Ns will be added

  --verbose_logging           If supplied, enable verbose login. NOTE: this can 
                              increase the size of the log files by an order of 
                              magnitude

  #######################################################################################
  ####################### hybpiper paralog_retriever options: ###########################
  #######################################################################################

  --paralogs_list_threshold_percentage PARALOGS_LIST_THRESHOLD_PERCENTAGE
                              Percent of total number of samples and genes that must
                              have paralog warnings to be reported in the
                              <genes_with_paralogs.txt> report file. The default is
                              0.0, meaning that all genes and samples with at least
                              one paralog warning will be reported
  --figure_length FIGURE_LENGTH
                              Length dimension (in inches) for the output heatmap
                              file. Default is automatically calculated based on the
                              number of genes
  --figure_height FIGURE_HEIGHT
                              Height dimension (in inches) for the output heatmap
                              file. Default is automatically calculated based on the
                              number of samples
  --sample_text_size SAMPLE_TEXT_SIZE
                              Size (in points) for the sample text labels in the
                              output heatmap file. Default is automatically
                              calculated based on the number of samples
  --gene_text_size GENE_TEXT_SIZE
                              Size (in points) for the gene text labels in the
                              output heatmap file. Default is automatically
                              calculated based on the number of genes
  --heatmap_filetype {png,pdf,eps,tiff,svg}
                              File type to save the output heatmap image as. Default
                              is png
  --heatmap_dpi HEATMAP_DPI
                              Dots per inch (DPI) for the output heatmap image.
                              Default is 300

  #######################################################################################
  ######################## hybpiper recovery_heatmap options: ###########################
  #######################################################################################

  --figure_length FIGURE_LENGTH
                              Length dimension (in inches) for the output heatmap
                              file. Default is automatically calculated based on the
                              number of genes
  --figure_height FIGURE_HEIGHT
                              Height dimension (in inches) for the output heatmap
                              file. Default is automatically calculated based on the
                              number of samples
  --sample_text_size SAMPLE_TEXT_SIZE
                              Size (in points) for the sample text labels in the
                              output heatmap file. Default is automatically
                              calculated based on the number of samples
  --gene_text_size GENE_TEXT_SIZE
                              Size (in points) for the gene text labels in the
                              output heatmap file. Default is automatically
                              calculated based on the number of genes
  --heatmap_filetype {png,pdf,eps,tiff,svg}
                              File type to save the output heatmap image as. Default
                              is *.png
  --heatmap_dpi HEATMAP_DPI
                              Dot per inch (DPI) for the output heatmap image.
                              Default is 150

Please see the Wiki entry Additional pipeline features and details for further explanation of the parameters above, and general pipeline functionality.

Output folders and files

If you're just after the unaligned .fasta files for each of your target genes (not including putative paralogs), the two main output folders of interest are probably:

If you need the per-sample output folders produced by a standard HybPiper run, these can be found in folder:

For a full explanation of output folders and files, please see the Wiki entry Output folders and files.

General information

For details on adapting the pipeline to run on local and HPC computing resources, see here.

Issues still to deal with (WIP)

Please see the Wiki entry Issues.

Changelog

31 May 2024

03 July 2023

15 May 2023

14 April 2023

01 February 2023

28 November 2022

09 November 2021