NGSEP - Next Generation Sequencing Experience Platform Version 5.0.0 (26-01-2024)
===========================================================================
NGSEP provides an object model to enable different kinds of analysis of DNA high throughput sequencing (HTS) data. The classic use of NGSEP is a reference guided construction and downstream analysis of large datasets of genomic variation. NGSEP performs accurate detection and genotyping of Single Nucleotide Variants (SNVs), small and large indels, short tandem repeats (STRs), inversions, and Copy Number Variants (CNVs). NGSEP also provides utilities for downstream analysis of variation in VCF files, including functional annotation of variants, filtering, format conversion, comparison, clustering, imputation, introgression analysis and different kinds of statistics. Version 5 includes new modules for read alignment and de-novo analysis of short and long reads including calculations of k-mers, error correction, de-novo analysis of Genotype-by-sequencing data and de-novo assembly of long read whole genome sequencing (WGS) data.
NGSEP has been compiled and run successfully on the standard jdk version 11.0.8. To build the distribution library NGSEPcore.jar on a unix based command line environment run the following commands in the directory where NGSEPcore_5.0.0.tar.gz is located:
tar -xzvf NGSEPcore_5.0.0.tar.gz cd NGSEPcore_5.0.0 make all
Note: Usage fields below do not include the version number. To remove the version number, users can either copy the executable jar file:
cp NGSEPcore_5.0.0.jar NGSEPcore.jar
or just make a symbolic link:
ln -s NGSEPcore_5.0.0.jar NGSEPcore.jar
It is possible to obtain usage information for each module by typing:
java -jar NGSEPcore.jar
General information and the list of modules can be obtained by typing:
java -jar NGSEPcore.jar [ --help | --version | --citing ]
Builds individual fastq files for different samples from fastq files of complete sequencing lanes in which several samples were barcoded and sequenced. Several lane files can be provided with the option -d or a single file can be provided instead with the option -f (and -f2 for paired-end sequencing). If neither the -d or the -f options are specified, the program tries to read single sequencing reads from the standard input.
USAGE:
java -jar NGSEPcore.jar Demultiplex
OPTIONS:
-i FILE : Tab-delimited file with at least four columns by
default: flowcell, lane, barcode and sampleID. If
the -a option for dual barcode is activated, five
columns are expected: flowcell, lane, barcode1,
barcode2 and sampleID. The file must have a header
line. The same index file can be used to demultiplex
several FASTQ files (see option -d).
-d FILE : Tab-delimited file listing the lane FASTQ files to be
demultiplexed. Columns are: Flowcell, lane and fastq
file (which can be gzip compressed). A second fastq
file can be specified for pair-end sequencing. If the
reads sequenced for one lane are split in multiple
files, each file (or each pair of files) should be
included in a separate row. If this option is used,
the options -f, -f2, -c and -l are ignored.
-o DIR : Directory where the output fastq files will be saved.
Files will be gzip compressed by default.
-f FILE : File with raw reads in fastq format. It can be gzip
compressed.
-f2 FILE : File with raw reads in fastq format corresponding to
the second file for paired end reads. It can be gzip
compressed.
-c STRING : Id of the flowcell corresponding to the input fastq
file(s). Ignored if the -d option is specified but
required if -d option is not specified.
-l STRING : Id of the lane corresponding to the input fastq
file(s). Ignored if the -d option is specified but
required if the -d option is not specified.
-t STRING : Sequences to trim separated by comma. If any of the
given sequences is found within a read, the read will
be trimmed up to the start of the sequence.
-u : Output uncompressed files.
-r INT : Minimum read length to keep a read after trimming
adapter sequences. Default: 40.
-a : Activate demultiplexing with dual barcoding.
Extracts k-mers and generates a distribution of k-mer abundances from a file of DNA sequences either in fastq or in fasta format (see -f option). Writes two files, one with the k-mer distribution and a second file with the actual k-mers and their counts.
USAGE:
java -jar NGSEPcore.jar KmersExtractor
OPTIONS:
-o FILE : Prefix of the output files.
-k INT : K-mer length. Default: 15
-m INT : Minimum count to report a k-mer in the output file.
Default: 5
-text : Indicates that the sequences should be treated as
free text. By default it is assumed that the given
sequences are DNA and then only DNA k-mers are
counted. If this option is set, the -s option is also
activated to process the text only in the forward
direction.
-s : If set, only the forward strand would be used to
extract kmers. Mandatory for non-DNA sequences.
-f INT : Format of the input file(s). It can be 0 for fastq or
1 for fasta. Default: 0
-c : Ignore low complexity k-mers for counting and reporting.
-t INT : Number of threads. Default: 1
Builds a k-mer abundance profile and use this profile to identify and correct sequencing errors. For each predicted single nucleotide error, it looks for the single change that would create k-mers within the normal distribution of abundances. Using the option -e, this function can also receive a precalculated table of k-mers, which could come from a larger number of reads or reads sequenced using a different technology. For example, a k-mers profile based on Illumina reads could be built using the KmersExtractor command, and then this profile could be used to perform error correction on long reads.
USAGE:
java -jar NGSEPcore.jar ReadsFileErrorsCorrector
OPTIONS:
-i FILE : Input file with raw reads in fastq or fasta format. See
option -f for options on the file format. It can be gzip
compressed.
-o FILE : Output file with the corrected reads in fastq format
(gzip compressed).
-e FILE : Two column tab delimited file with k-mers and their
abundances.
-k INT : K-mer length. Default: 15
-m INT : Minimum k-mer count to consider a k-mer real. Default: 5
-s : If set, only the forward strand would be used to extract
kmers. Mandatory for non-DNA sequences.
-f INT : Format of the input file. It can be 0 for fastq or 1 for
fasta. Default: 0
Performs de novo variants discovery from a genotype-by-sequencing (GBS) or a double digestion RAD sequencing (ddRAD) experiment. Runs a clustering algorithm based on quasi-exact matches to representative k-mers within the first base pairs of each sequence. Then, it performs variants detection and sample genotyping within each cluster using the same Bayesian model implemented for the reference-guided analysis. By now it can only discover and genotype Single Nucleotide Variants (SNVs).
USAGE:
java -jar NGSEPcore.jar DeNovoGBS
OPTIONS:
-i FILE : Directory with fastq files to be analyzed. Unless the
-d option is used, it processes as single reads all
fastq files within the given directory.
-o FILE : Prefix for the output VCF file with the discovered
variants and genotype calls as well as other output
files describing the behavior of this process.
-d FILE : Tab delimited text file listing the FASTQ files to be
processed for paired-end sequencing. It should have
three columns. sample id, first fastq file and second
fastq file. All files should be located within the
directory provided with the option -i.
-k INT : K-mer length. Default: 31
-c INT : Maximum number of read clusters to process. This
parameter controls the amount of memory spent by the
process. Default: 2000000
-t INT : Number of threads to process read clusters. Default: 1
-maxBaseQS INT : Maximum value allowed for a base quality score.
Larger values will be equalized to this value.
Default: 30
-ignore5 INT : Ignore this many base pairs from the 5' end of the
reads. Default: 0
-ignore3 INT : Ignore this many base pairs from the 3' end of the
reads. Default: 0
-h DOUBLE : Prior heterozygosity rate. Default: 0.001
-minQuality INT : Minimum variant quality. In this command, this filter
applies to the QUAL column of the VCF, which is
calculated for each variant as the maximum of the
genotype qualities of samples with non-homozygous
reference genotype calls. See the command VCFFilter
to apply filters of quality and read depth on
individual genotype calls. Default: 40
-ploidy INT : Default ploidy of the samples. Default: 2
Builds a de-novo assembly from a set of long reads following an overlap-layout-consensus (OLC) approach. It receives a fasta or fastq file with raw PacBio HiFi or Nanopore reads and generates an assembly for the sample in fasta format. It also generates a grap.gz file with the information of the overlap graph. This graph can be provided as input in a second run skip the graph construction step. Note: Although this functionality is already producing competitive assemblies compared to other tools, the following versions will probably have improvements on contiguity and phasing of diploid assemblies.
USAGE:
java -jar NGSEPcore.jar Assembler
OPTIONS:
-i FILE : Input file. See option -f for options on the file
format. It can be gzip compressed.
-o FILE : Prefix of the output files.
-g FILE : File with a saved graph to perform layout and
consensus.
-k INT : K-mer length to identify overlaps Default: 15
-f INT : Format of the input file. It can be 0 for fastq or
1 for fasta. Default: 0
-w INT : Window length to calculate minimizers. Default: 30
-m INT : Minimum read length. Default: 5000
-mspe DOUBLE : Minimum proportion from the maximum score of the
edges of a sequence to keep an edge. Default: 0.3
-ploidy INT : Ploidy of the sample. Keep ploidy of 1 if the sample
is inbred, even if it is diploid or polyploid. This
option is still in progress and it has been tested
only in haploid and diploid samples. Default: 1
-cml INT : Maximum length of circular molecules Default: 0
-cmof FILE : Fasta file with known start sequences of circular
molecules
-ac STRING : Algorithm used to build the consensus. It can be
Simple or Polishing. Default: Polishing
-ecr INT : Number of rounds of alignment based error correction
to perform. Default: 0
-wid DOUBLE : Weight given to small indel differences in the
calculation of edge costs. Real number between 0 and
1. Increase if the reads have very low error rate
(for example HiFi reads already corrected).
Default: 0
-t INT : Number of threads. Default: 1
Sorts contigs of a de-novo assembly by mapping to a reference assembly of an individual of either the same or a close species. It does not join contigs. It only sorts and orient contigs to make the input genome colinear with the reference genome. WARN: Although we already see good sorting and orientation in many test cases, this functionality is still under testing. Further improvements are expected for the coming versions.
USAGE:
java -jar NGSEPcore.jar AssemblyReferenceSorter
OPTIONS:
-i FILE : Input genome in FASTA format. It can be gzip compressed.
-o FILE : Output file
-r GENOME : Reference genome to map contigs in FASTA format.
Required parameter. It can be gzip compressed.
-k INT : K-mer length Default: 25
-w INT : Window length to calculate minimizers. Default: 40
-rcp INT : Policy to rename contigs. 0: keep input names. 1: Use
reference chromosome and relative consecutive numbers.
2: use absolute consecutive numbers. Default: 1
-t INT : Number of threads Default: 1
Standardize the start and orientation of circular sequences in genome assemblies. It receives a fasta file with the assembly, identifies sequences with repeated ends, and remove redundancies. A set of start sequences can be provided to map and reorient contigs according to these start sequences.
USAGE:
java -jar NGSEPcore.jar CircularSequencesProcessor
OPTIONS:
-i FILE : Input genome in FASTA format. It can be gzip compressed.
-o FILE : Output file
-s FILE : Fasta file with sequences to be used as start sequences
-ml INT : Maximum length of a contig to try circularization. Longer
contigs will not be modified Default: 6000000
Takes a VCF file with genotype information from one sample and the reference genome used to build the VCF and generates a new genome in fasta format with a ploidy consistent with the ploidy of the individual. For diploid or polyploid assemblies, the VCF file must be properly phased. For non haploid individuals, if the ploidy parameter is set to 1, this function performs polishing of a haploid genome assembly.
USAGE:
java -jar NGSEPcore.jar IndividualGenomeBuilder
OPTIONS:
-i FILE : Fasta file with the original genome.
-v FILE : File in VCF format with the variants that will be
applied to the input genome.
-ploidy INT : Ploidy of the sample. To make polishing of a haploid
assembly for a non haploid individual, set this
parameter to 1. Default: 2
-o FILE : Output file in fasta format with the modified genome.
Creates a binary file containing an FM index for large sequences in fasta format (usually a reference genome). This structure facilitates performing massive text searches over the indexed sequence. This is a usual preparation step for alignment of short reads.
USAGE:
java -jar NGSEPcore.jar GenomeIndexer
OPTIONS:
-i FILE : Input genome to index in fasta format.
It can be gzip compressed.
-o FILE : Output binary file with the FM index associated with the
input genome.
Calculates a list of genomic regions for sites where the reads can be found in a reference genome. It receives up to two files with raw reads in fastq format and the reference genome. To map short reads to long genomes, a precalculated FM index can also be provided with the option -d. See command GenomeIndexer for construction of the FM index. It provides as output a file with alignments to the reference genome in BAM format.
USAGE:
java -jar NGSEPcore.jar ReadsAligner
OPTIONS:
-i FILE : Input file with raw reads in fastq format. It can be
gzip compressed. Required if the second fastq file is
provided using the option -i2.
-i2 FILE : Input file with raw reads in fastq format
corresponding to the second file for paired end reads.
It can be gzip compressed.
-o FILE : Output file with the aligned reads in BAM format.
-r GENOME : Reference genome to align reads in FASTA format.
Required parameter. It can be gzip compressed.
-d FILE : FM-index of the reference genome to align short reads.
See GenomeIndexer for instructions to generate this
file. For large genomes it is more efficient to index
the reference once and provide the index with this
option.
-s STRING : Id of the sample. Default: Sample
-p STRING : Sequencing platform used to produce the reads.
Supported platforms include ILLUMINA, IONTORRENT,
PACBIO and ONT. Default: ILLUMINA
-knownSTRs FILE : Text file with location of known short tandem repeats
(STRs). It is a tab-delimited file with at least
three columns: Sequence name (chromosome), region
first base pair coordinate (1-based, inclusive) and
region last base pair coordinate (1-based, inclusive).
-f INT : Format of the input file. It can be 0 for fastq or 1
for fasta. Default: 0
-k INT : K-mer length. Default: 25
-m INT : Maximum alignments per read. Default: 3
-minIL INT : Minimum predicted insert length to consider an
alignment proper. Default: 0
-maxIL INT : Maximum predicted insert length to consider an
alignment proper. Default: 1000
-w INT : Window length to compute minimizers. Default: 20
-t INT : Number of threads used to align reads. Default: 1
Takes one or more sets of alignments and a reference genome and counts the number of mismatches with the reference for each read position from 5' to 3' end. This report is useful to detect sequencing error biases. Requires one or more alignment files in SAM, BAM or CRAM format, and the reference genome that was used to produce the alignments. Writes to standard output unless the -o option is used to specify an output file.
USAGE:
java -jar NGSEPcore.jar BasePairQualStats
OPTIONS:
-o FILE : Output file with the base pair quality statistics.
-r FILE : Fasta file with the reference genome.
-minMQ INT : Minimum mapping quality to call an alignment unique.
Default: 20
The file(s) with alignments must be given in SAM, BAM or CRAM format and the reference file in fasta format. The output is a text file with five columns:
Calculates the number of base pairs that are covered by reads at each read depth level from 1 to a maximum. Alignments must be in SAM, BAM or CRAM format. Writes to standard output unless the -o option is used to specify an output file.
USAGE:
java -jar NGSEPcore.jar CoverageStats
OPTIONS:
-i FILE : Input file with alignments to analyze.
-o FILE : Output file with the coverage distribution.
-r GENOME : Fasta file with the reference genome. Required for
CRAM files.
-minMQ INT : Minimum mapping quality to call an alignment unique.
Default: 20
The alignments file must be given in SAM or BAM format. The output is a text file with three columns:
This module allows to call variants over a group of samples separated by files or read group tags. This is now the recommended method to perform variants detection on genotype-by-sequencing (GBS), RAD sequencing, whole exome sequencing (WES), RNA-seq and low coverage (less than 10x) whole genome sequencing (WGS) data. Although it can also be used on high coverage WGS data, the classic sample-by-sample analysis (commands SingleSampleVariantsDetector, MergeVariants and VCFMerge) is still recommended to identify structural variants. This module requires one or more read alignment files in SAM, BAM or CRAM format and the reference genome that was used to produce the alignments. Alignments must be sorted by reference coordinates.
USAGE:
java -jar NGSEPcore.jar MultisampleVariantsDetector
OPTIONS:
-r GENOME : Fasta file with the reference genome.
-o FILE : Output VCF file with discovered variants and
genotype calls. Default: variants.vcf
-ploidy INT : Default ploidy of the samples. Default: 2
-psp : Print id and ploidy of the sample in the VCF
header. The header generated with this option
is not a standard VCF header. However, it
helps NGSEP to keep track of the ploidy of
the samples through downstream analyses.
-minMQ INT : Minimum mapping quality to call an alignment
unique. Default: 20
-knownVariants FILE : VCF file with variants to be genotyped. Only
these variants will appear in the output VCF.
-querySeq STRING : Call variants just for this sequence.
-first INT : Call variants just from this position in the
given query sequence.
-last INT : Call variants just until this position in the
given query sequence.
-ignoreLowerCaseRef : Ignore sites where the reference allele is
lower case.
-maxAlnsPerStartPos INT : Maximum number of alignments allowed to start
at the same reference site. This parameter
helps to control false positives produced by
PCR amplification artifacts. In this command,
this filter is executed independently for
each read group. For GBS or RAD sequencing
data use a large value such as 100. Default: 5
-p : Process non unique primary alignments in the
pileup process. The default behavior is to
process alignments that are unique
(see option -minMQ).
-s : Consider secondary alignments in the pileup
process. Non-unique primary alignments will
also be considered in this mode.
-ignore5 INT : Ignore this many base pairs from the 5' end of
the reads. Default: 0
-ignore3 INT : Ignore this many base pairs from the 3' end of
the reads. Default: 0
-h DOUBLE : Prior heterozygosity rate. Default: 0.001
-maxBaseQS INT : Maximum value allowed for a base quality
score. Larger values will be equalized to
this value. This parameter allows to reduce
the effect of sequencing errors with high
base quality scores. Default: 30
-knownSTRs FILE : File with known short tandem repeats (STRs).
This is a text file with at least three
columns: chromosome, first position and last
position. Positions should be 1-based and
inclusive.
-minQuality INT : Minimum variant quality. In this command,
this filter applies to the QUAL column of the
VCF, which is calculated for each variant as
the maximum of the genotype qualities of
samples with non-homozygous reference
genotype calls. See the command VCFFilter to
apply filters of quality and read depth on
individual genotype calls. Default: 40
-embeddedSNVs : Flag to call SNVs within STRs. By default,
STRs are treated as a single locus and hence
no SNV will be called within an STR.
Alignments should be provided in SAM, BAM or CRAM format (see http://samtools.github.io/hts-specs for details). The reference genome should be provided in fasta format. It is assumed that the sequence names in the alignments file correspond with the sequence names in this reference assembly. For this module, alignment files must include RG headers including the ID of each read group and the corresponding sample. A typical read group header looks as follows
@RG ID:
According to the specification, read groups must be unique across different samples. Each alignment must include an RG tag indicating the id of the read group of the aligned read. See the SAM format specification for details. Check the documentation of your read aligner to make sure that alignment files contain read group headers. This module uses read group headers to distribute the reads that belong to the different samples.
DETAILS OF OUTPUT FILES: The output of this module is a VCF file (see the standard format at http://samtools.github.io/hts-specs/VCFv4.2.pdf). NGSEP uses standard output file formats such as VCF and GFF to facilitate integration with other tools and visualization in web genome browsers such as jbrowse, gbrowse and the UCSC genome browser, or desktop browsers such as the Integrative Genomics Viewer (IGV). This allows integrated visuallzation of read alignments, variants and functional elements. Moreover, the NGSEP output files provide as optional fields custom information on the variants and genotype calls. For each variant, NGSEP VCF files include the following custom fields in the INFO column (described also in the VCF header):
TYPE (STRING) : Type of variant for variants different than biallelic SNVs. Possible types include MULTISNV, INDEL and STR (short tandem repeat). Also, SNVs called within INDELS or STRs are tagged with the EMBEDDED type. CNV (INT) : Number of samples with CNVs covering this variant. This will not be generated by this command but by the classical per sample analysis, and only if the read depth analysis is executed NS (INT) : Number of samples genotyped. MAF (DOUBLE) : Minor allele frequency. Calculated by the VCFMerge and the VCFFilter commands AN (INT) : Number of different alleles observed in called genotypes. AFS (INT*) : Allele counts over the population for all alleles, including the reference. One number per allele.
Additionally, the Annotate command adds the INFO fields TA, TID, TGN, TCO and TACH with the results of the functional annotation (See the Annotate command below for details). NGSEP VCFs also include custom format fields with the following information for each genotype call:
ADP (INT,INT,...) : Number of base calls (depth) for alleles, including the reference allele. The order of the counts is, first the depth of the reference allele and then the read depths of the alternative alleles in the order listed in the ALT field. BSDP (INT,INT,INT,INT) : For SNVs, number of base calls (depth) for the 4 possible nucleotides, sorted as A,C,G,T. ACN (INT, INT, ...) : Predicted copy number of each allele taking into account the prediction of number of copies of the region surrounding the variant. The order is the same that in the ADP field
WARNING 1: Since v3.2.0, for RAD Sequencing or genotype-by-sequencing (GBS) we now recommend to perform variants detection and genotyping using this module. However, using the default value of the parameter to control for PCR duplicates (maxAlnsPerStartPos) will yield very low sensitivity. We recommend to increase the value of the parameter to about 100 to retain high sensitivity while avoiding a severe penalty in memory usage. The default usage for RAD-Seq or GBS samples becomes:
java -jar NGSEPcore.jar MultisampleVariantsDetector -maxAlnsPerStartPos 100 -r
WARNING 2: Unlike the behavior of the classical individual analysis per sample, in this command the filter executed using the minQuality option applies to the QUAL field of the VCF format, which corresponds to the probability of existence of each variant (regardless of the individual genotype calls). In this module the QUAL probability is calculated for each variant as the maximum of the genotype qualities of samples with non-homozygous reference genotype calls. The rationale for this calculation is that one variant should be real if it is confidently called in at least one sample. Individual genotype calls are not filtered by default and hence they could include some false positives. Please see the VCFFilter command to perform custom filtering of genotype calls, either by genotype quality (GQ format field) or by read depth (BSDP and ADP format fields). Default values of other parameters are also set to maximize sensitivity. For conservative variant detection including control for errors in base quality scores and PCR amplification artifacts use:
java -jar NGSEPcore.jar MultisampleVariantsDetector -maxAlnsPerStartPos 2 -r
If the error rate towards the three prime end increases over 2% you can also use the option -ignore3 to ignore errors at those read positions. If the reference genome has lowercase characters for repetitive regions (usually called softmasked), these regions can be directly filtered using the option -ignoreLowerCaseRef. These regions can also be filtered at later stages of the analysis using the VCFFilter command.
This is the classic module of NGSEP to call SNVs, small indels and structural variants from sequencing data of single individuals. Basic usage requires an alignments file in SAM, BAM or CRAM format, the reference genome that was used to produce the alignments, and a prefix for the output files. Alignments must be sorted by reference coordinates.
USAGE:
java -jar NGSEPcore.jar SingleSampleVariantsDetector
OPTIONS:
-i FILE : Input file with read alignments.
-r FILE : Fasta file with the reference genome.
-o FILE : Prefix for the output files.
-sampleId STRING : Id of the sample for the VCF file. If not set
it looks in the BAM file header for an SM
header tag. If this tag is not present, it
uses the default value. Default: Sample
-ploidy INT : Ploidy of the sample to be analyzed.
Default 2
-psp : Flag to print a header in the VCF file with
the id and the ploidy of the sample. The
header generated with this option is not a
standard VCF header. However, it helps NGSEP
to keep track of the ploidy of each sample
through downstream analyses.
-minMQ INT : Minimum mapping quality to call an alignment
unique. Default: 20
-knownVariants FILE : VCF file with variants to be genotyped. Only
these variants will appear in the output vcf
file. With this option homozygous calls to
the reference allele will be reported
-querySeq STRING : Call variants just for this sequence.
-first INT : Call variants just from this position in the
given query sequence
-last INT : Call variants just until this position in the
given query sequence
-ignoreLowerCaseRef : Ignore sites where the reference allele is
lower case.
-maxAlnsPerStartPos INT: Maximum number of alignments allowed to start
at the same reference site. This parameter
helps to control false positives produced by
PCR amplification artifacts. For GBS or RAD
sequencing data use a large value such as 100.
Default 5.
-p : Process non unique primary alignments in the
pileup process. The default behavior is to
process alignments that are unique
(see option -minMQ)
-s : Consider secondary alignments while calling
SNVs. Non-unique primary alignments will also
be considered in this mode.
-ignore5 INT : Ignore this many base pairs from the 5' end
of the reads. Default: 0
-ignore3 INT : Ignore this many base pairs from the 3' end
of the reads. Default: 0
-h FLOAT : Prior heterozygosity rate. Default: 0.001
-maxBaseQS INT : Maximum value allowed for a base quality
score. Larger values will be equalized to
this value. This parameter allows to reduce
the effect of sequencing errors with high
base quality scores. Default: 30
-knownSTRs FILE : File with known short tandem repeats (STRs).
This is a text file with at least three
columns, chromosome, first position and last
position. Positions should be 1-based and
inclusive.
-minQuality INT : Minimum genotype quality to accept a SNV call
Genotype quality is calculated as 1 minus the
posterior probability of the genotype given
the reads (in phred scale). Default: 0
-embeddedSNVs : Flag to call SNVs within STRs. By default,
STRs are treated as a single locus and hence
no SNV will be called within an STR.
-csb : Calculate a exact fisher test p-value for
strand bias between the reference and the
alternative allele
-knownSVs FILE : File with coordinates of known structural
variants in GFF format.
-minSVQuality INT : Minimum quality score (in PHRED scale) for
structural variants. Default: 20
-runRep : Turns on the procedure to find repetitive
regions based on reads with multiple
alignments.
-runRD : Turns on read depth (RD) analysis to identify
CNVs
-genomeSize INT : Total size of the genome to use during
detection of CNVs. This should be used when
the reference file only includes a part of
the genome (e.g. a chromosome or a partial
assembly).
-binSize INT : Size of the bins to analyze read depth.
Default: 100
-algCNV STRING : Comma-separated list of read depth algorithms
to run (e.g. CNVnator,EWT). Default: CNVnator
-maxPCTOverlapCNVs INT : Maximum percentage of overlap of a new CNV
with an input CNV to include it in the output
Default: 100 (No filter)
-runRP : Turns on read pair plus split-read analysis
(RP+SR) to identify large indels and
inversions.
-maxLenDeletion INT : Maximum length of deletions that the
RP analysis can identify. Default: 1000000
-sizeSRSeed INT : Size of the seed to look for split-read
alignments. Default: 8
-ignoreProperPairFlag : With this option, the proper pair flag will
not be taken into accout to decide if the ends
of each fragment are properly aligned. By
default, the distribution of insert length is
estimated only taking into account reads with
the proper pair flag turned on.
-runOnlySVs : Turns off SNV detection. In this mode, only
structural variation will be called
-runLongReadSVs : Runs the DBScan algorithm to identify structural
variants from alignments of long reads
Alignments should be provided in SAM, BAM or CRAM format (see http://samtools.github.io/hts-specs for details). The reference genome should be provided in fasta format. The output for SNVs, small indels and STRs is a VCF file. These standard formats are used to facilitate integration with other tools. See more details above in the description of the MultisampleVariantsDetector command. Structural variants are reported in a gff file (see the standard format at http://www.sequenceontology.org/gff3.shtml). This file can be used as a parameter of the variants detector (option "-knownSVs") which is useful to avoid recalculation of structural variants while genotyping known variants. GFF files provided by NGSEP include the following INFO fields:
LENGTH (INT) : Predicted length of the event. For insertions and deletions identified with read pair analysis, this length is not the reference span but the average of the lengths predicted by each read pair having an alignment with a predicted insert length significantly larger or shorter than the average fragment length. SOURCE (STRING) : Algorithm that originated each variant call. Current values include MultiAlns for repeats, Readpairs for read pair analysis and CNVnator and EWT for read depth analysis. NSF (INT) : Number of fragments supporting the structural variation event. For read depth algorithms is the (raw) number of reads that can be aligned within the CNV. For read pair analysis is the number of fragments (read pairs) that support the indel or the inversion. For repeats is the number of reads with multiple alignments NC (DOUBLE) : For CNVs called with the read depth algorithms this is the estimated number of copies. It is kept as a real number to allow users to filter by proximity to an integer value if needed. HET (INT) : For CNVs called with the read depth algorithms this is the number of heterozygous genotype calls in the VCF file enclosed within the CNV. Always zero if the option -noSNVS is used. NTADF (INT) : For CNVs called with the read depth algorithms this is the number of paired-end fragments showing an alignment pattern consistent with a tandem duplication NTRDF (INT) : For CNVs called with the read depth algorithms this is the number of paired-end fragments in which one read aligns within the CNV and its pair aligns to another chromosome or with a very long insert length. These fragments are useful to classify the CNV as an interspersed (trans) duplication. TGEN (STRING) : For CNVs called with the read depth algorithms this is a qualitative evaluation of the genotype call based on the values of the fields NC, NTADF and NTRDF, and on the normal ploidy of the sample. Possible values are DEL, TANDEMDUP and TRANSDUP. NSR (INT) : Number of reads with split alignments supporting an insertion or deletion. Events supported only by split-read analysis have NSF=0 and NSR>0. NUF (INT) : For repeats identified from reads aligning to multiple locations, this is the number of fragments with unique alignments within the repeat.
The default minimum genotype quality of the variants detector (0) will maximize the number of called variants at the cost of generating some false positives in samples with small coverage or high sequencing error rates. For conservative variant calling from whole genome sequencing reads use:
java -jar NGSEPcore.jar SingleSampleVariantsDetector -maxAlnsPerStartPos 2 -minQuality 40 -r
If interested in structural variation, for Illumina reads you can add the options to run read depth (RD) and read pair plus split read (RP+SR) approaches to identify structural variation:
java -jar NGSEPcore.jar SingleSampleVariantsDetector -runRD -runRP -maxAlnsPerStartPos 2 -minQuality 40 -r
To detect structural variants from long reads, add the option -runLongReadSVs
java -jar NGSEPcore.jar SingleSampleVariantsDetector -runLongReadSVs -maxAlnsPerStartPos 2 -minQuality 40 -r
If the error rate towards the three prime end increases over 2% you can also use the option -ignore3 to ignore errors at those read positions. If the reference genome has lowercase characters for repetitive regions (usually called softmasked), these regions can be directly filtered using the option -ignoreLowerCaseRef. These regions can also be filtered at later stages of the analysis using the VCFFilter command.
Since v 3.2.0, for RAD Sequencing or genotype-by-sequencing (GBS) we now recommend the MultisampleVariantsDetector command described above. However, the classic per-sample analysis pipeline using this command is still functional with good quality. For both commands it is important to keep in mind that using the default value of the parameter to control for PCR duplicates (maxAlnsPerStartPos) will yield very low sensitivity. We recommend to increase the value of the parameter to about 100 to retain high sensitivity while avoiding a severe penalty in memory usage. Also, structural variants should not be called using these data. The usage for conservative variant calling in RAD-Seq or GBS samples becomes:
java -jar NGSEPcore.jar SingleSampleVariantsDetector -maxAlnsPerStartPos 100 -minQuality 40 -r
This module can also be used to discover variants at different allele frequencies in pooled samples. For this use case, the ploidy parameter should be adjusted to the number of haplotypes present within the pool. The prior heterozygosity rate should also be increased according to the expected proportion of heterozygous variants across the entire population. For targeted sequencing, the maxAlnsPerStartPos parameter should also be adjusted to make it larger than the expected maximum read depth per site. For example, if 50 diploid individuals were included in a pool and sequenced at 20x per individual, then this parameter should be larger than 1000. This is a usage example to identify low frequency variants from a pool of 48 individuals in a Tilling experiment:
java -jar NGSEPcore.jar SingleSampleVariantsDetector -maxAlnsPerStartPos 5000 -h 0.1 -ploidy 96 -r
Performs molecular haplotyping on a single individual given a VCF and a set of alignments in SAM, BAM or CRAM format. Although theoretically it can work with Illumina reads, it is designed to work fine with long (PacBio) reads. Alignments must be sorted by reference coordinates. Writes to standard output unless the -o option is used to specify an output file.
USAGE:
java -jar NGSEPcore.jar SIH
OPTIONS:
-i FILE : Input VCF file with variants to phase.
-b FILE : Input file with read alignments.
-o FILE : Output VCF file with phased variants.
-a STRING : Algorithm for single individual haplotyping. It can
be Refhap or DGS. Default: DGS
-minMQ INT : Minimum mapping quality to call an alignment unique.
Default: 20
-r GENOME : Fasta file with the reference genome. Required for
CRAM files.
The classical pipeline of NGSEP performs two steps to merge variants and genotype calls from different samples into an integrated VCF file. After independent variant discovery using the command SingleSampleVariantsDetector for each sample, the next step is to generate a file including the whole set of variants called in at least one of the samples. This can be done calling the MergeVariants command, which has the following usage:
USAGE:
java -jar NGSEPcore.jar MergeVariants
OPTIONS:
-s FILE : List of sequence names as they appear in the original
reference genome
-o FILE : Output VCF file with merged variants
The sequence names file is a text file which just has the ids of the sequences in the reference. It is used by the program to determine the order of the reference sequences. In unix systems this file can be obtained running the following command on the fasta file with the reference genome:
awk '{if(substr($1,1,1)==">") print substr($1,2) }'
If samtools is available. The fai index file provided by this tool can also be used as a sequence names file. The fai index is generated with this command:
samtools faidx
The output file of the merge program is a vcf with the union of variants reported by the input files but without any genotype information.
The next step is to genotype for each sample the variants produced by MergeVariants using the variants detector (See SingleSampleVariantsDetector). For each sample, the command to execute at this stage (in conservative mode) should look like this:
java -jar NGSEPcore.jar SingleSampleVariantsDetector -maxAlnsPerStartPos 2 -minQuality 40 -knownVariants
where VARS_FILE is the output file obtained in the first step of the merging process. At the end, this will produce a second set of vcf files which will differ from the first set in the sense that they will include calls to the reference allele. The last step is to join these new vcf files using the VCFMerge command:
USAGE:
java -jar NGSEPcore.jar VCFMerge
OPTIONS:
-s FILE : List of sequence names as they appear in the original
reference genome
-o FILE : Output VCF file with merged variants and genotype information
This command will write the final vcf file with the genotype calls for each variant on each sample.
For tilling experiments, this module takes variants from pools and a pools descriptor and calls individual variants. It receives a list of VCF files generated either by the SingleSampleVariantsDetector or the MultisampleVariantsDetector commands and a pools configuration file and generates a VCF file with individual genotyping based on the variants identified within the pools.
USAGE:
java -jar NGSEPcore.jar TillingPoolsIndividualGenotyper
OPTIONS:
-r GENOME : Fasta file with the reference genome.
-o FILE : Output file with called variants in VCF format
-d FILE : File with the information of individuals assigned to
each pool
-m INT : Maximum number of pools in which a variant can
appear. Default 0 (no filter).
-b : Report only biallelic variants.
A pools configuration file must be provided with the option -d. It should be a text file separated by semicolon and having one row for each individual. The first entry on each row should be the individual id. The remaining entries should be the ids of the different pools where the individual was included. For example, if individual 20 was included in pools with ids 2, 10 and 14, the line should look like this:
20;2;10;14
The sample ids within input pool VCF files must coincide with the pool ids present in the pools configuration file. The VCF files can include information for one or more poools.
Calculates a distribution of relative allele counts for sites showing base calls for more than one nucleotide from read alignment files in SAM, BAM or CRAM format. This analysis is useful to predict the ploidy of a sequenced sample. Alignments must be sorted by reference coordinates.
USAGE:
java -jar NGSEPcore.jar RelativeAlleleCountsCalculator
OPTIONS:
-i FILE : Input file with read alignments.
-o FILE : Output file with statistics.
-r GENOME : Fasta file with the reference genome. Required for
CRAM files.
-m INT : Minimum read depth Default: 10
-M INT : Maximum read depth Default: 1000
-q INT : Minimum base quality score (Phred scale) Default: 20
-frs FILE : File with repeats (or any kind of genomic regions)
that should not be taken into account in the
analysis. The format of this file should contain
three columns: Sequence name (chromosome), first
position in the sequence, and last position in the
sequence. Both positions are assumed to be 1-based.
-srs FILE : File with genomic regions that should be taken into
account in the analysis. The format of this file
should contain three columns: Sequence name
(chromosome), first position in the sequence, and
last position in the sequence. Both positions are
assumed to be 1-based.
-of FILE : Separate output file with the complete counts
information for sites with more than one allele. The
file has the following data separated by tab:
sequence name, position, read depth, number of
different alleles, max depth, and second max depth
-s : Consider secondary alignments. By default, only
primary alignments are processed.
This function compares the read depth of two samples to predict regions with relative copy number variation (CNV) between a sample and a control. It takes two alignment files and a reference genome, splits the genome into windows, and for each window compares the read depth between the two samples. It outputs a text file containing the list of windows of the genome in which the normalized read depth ratio between the two samples is significantly different from 1. Alignments can be provided in SAM, BAM or CRAM format and must be sorted by reference coordinates.
USAGE:
java -jar NGSEPcore.jar CompareRD
OPTIONS:
-i FILE : Input file with alignments to a reference genome.
-c FILE : Input alignments file corresponding to the control
(wild type) sample.
-o FILE : File with genomic regions in which the two samples
have different read depth.
-r FILE : Fasta file with the reference genome.
-w INT : Window length to be used during the read depth
comparison. Default: 100
-p DOUBLE : Maximum p-value. Only the windows with a p-value
lower than that specified will be reported.
Default: 0.001
-a : Output an entry for every window in the genome.
-gc : Perform GC-correction of the read depth.
-b : Perform the Bonferroni correction for multiple
testing.
The output text file contains the following columns:
Loads a transcriptome annotation in GFF3 format, logs format errors, provides statistics on the assembled transcriptome, generates cDNA, CDS and protein sequences.
USAGE:
java -jar NGSEPcore.jar TranscriptomeAnalyzer
OPTIONS:
-i FILE : Input GFF3 file with gene annotations. It can be gzip
compressed.
-o FILE : Prefix of the output files. It can be an absolute path
finished by the prefix
-r FILE : Fasta file with the reference genome. It can be gzip
compressed.
Loads a transcriptome annotation in GFF3 format and generates a filtered file by CDS length, presence of start and stop codons and intersection with other regions. Writes to standard output unless the -o option is used to specify an output file.
USAGE:
java -jar NGSEPcore.jar TranscriptomeFilter
OPTIONS:
-i FILE : Input GFF3 file with gene annotations. It can be gzip
compressed.
-o FILE : Output file with filtered genes. See option -f for
output format options.
-r FILE : Fasta file with the reference genome.
-f INT : Output format. 0: GFF3, 1: gene list, 2: gene
regions, 3: transcript list, 4: transcript regions.
Default: 0
-c : Output only complete transcripts (with start and stop
codons) in the output file.
-l INT : Minimum protein length for coding transcripts in the
output file. Default: 0
-frs FILE : File with genomic regions in which transcripts should
be filtered out. The format of this file should
contain three columns: Sequence name (chromosome),
first position in the sequence, and last position in
the sequence. Both positions are assumed to be 1-based.
-srs FILE : File with genomic regions in which transcripts should
be selected. The format of this file should contain
three columns: Sequence name (chromosome), first
position in the sequence, and last position in the
sequence. Both positions are assumed to be 1-based.
-fgid FILE : File with ids of genes that should be filtered out.
The first column should have the gene ids. Other
columns are ignored.
-sgid FILE : File with ids of genes that should be selected. The
first column should have the gene ids. Other columns
are ignored.
This module takes a list of assembled genomes in fasta format and their corresponding transcriptomes in GFF3 format and runs whole genome comparisons. It calculates orthogroups including orthologs and paralogs. It also identifies synteny relationships between each pair of genomes. Finally, it calculates gene presence/absence matrices and classifies gene families as core or accessory.
USAGE:
java -jar NGSEPcore.jar GenomesAligner
or
java -jar NGSEPcore.jar GenomesAligner -d
OPTIONS:
-d STRING : Directory having the input genomes in fasta format
and the genome annotations in gff3 format.
-i STRING : Input file with genome identifiers. These identifiers
are used as prefixes for the fasta and gff3 files.
-o STRING : Prefix for output files. Default: genomesAlignment
-k INT : K-mer length to find orthologs. Default: 6
-p INT : Minimum percentage of k-mers to call orthologs.
Default: 11
-s : Skip the MCL clustering phase and return unfiltered
orthogroups.
-yh INT : Minimum number of consistent homology units to call a synteny
block. Default: 6
-yd INT : Maximum distance (in basepairs) between homology units to
include them within the same synteny block. Default: 100000
-f DOUBLE : Minimum frequency to classify soft core gene families.
Default: 0.9
-t INT : Number of threads. Default: 1
The options -d and -i are useful to process large numbers of genome assemblies. The file referred with the option -i should have one genome identifier for each line:
Genome1 Genome2 Genome3 ... GenomeN
For each genome, the module will look into the directory referred with the option -d for one fasta file and one gff3 file having as prefix the genome identifier. Possible suffixes for the fasta file include .fa, .fna, .fas and .fasta and their gzip compressed extensions .fa.gz, .fna.gz, .fas.gz and .fasta.gz Possible suffixes for the annotation file include .gff and .gff3 and their gzip compressed extensions .gff.gz and .gff3.gz
The output is a series of text files having the ids and physical coordinates of
the paralogs within each genome and the orthologs between the two genomes.
The ortholog files, called
The files with the paralogs, called