eclarke / komplexity

A method of assessing sequence complexity based on kmer frequencies
28 stars 9 forks source link

komplexity

Build Status

A command-line tool built in Rust to quickly calculate and/or mask low-complexity sequences from a FAST[A/Q] file. This uses the number of unique k-mers over a sequence divided by the length to assess complexity.

We've noticed that, for k=4, a normalized complexity score of < 0.55 suggests across a 64-120bp region strongly that the sequence is a low-complexity repeat sequence.

Installation

Assuming you have Rust and Cargo installed:

git clone https://github.com/eclarke/komplexity
cd komplexity
cargo install

Or, using Conda:

conda install -c eclarke komplexity

If Cargo's default bin directory is in your path (usually $HOME/.cargo/bin), you can run this by just typing kz.

Usage

This tool has two modes: measuring (default) and masking. Both modes read from standard in and write to standard out.

Measuring

Measuring mode reports the length of a sequence, the number of unique kmers in that sequence, and the normalized complexity (unique kmers / length). As this looks across the entire sequence, it is most appropriate for short reads (e.g. Illumina), since longer sequences will have low normalized complexity simply due to greater length.

# For fastq files:
kz < seqs.fq
# For fasta files:
kz --fasta < seqs.fa
# Vary length of k:
kz -k5 < seqs.fq

The output is a list of sequence IDs from the fastq file with the length of the sequence, the number of unique k-mers in the sequence, and the number of unique k-mers/length (normalized complexity) in a tab-delimited format to stout.

Masking

Masking mode (--mask) outputs the input sequences with low-complexity regions masked by Ns. The low-complexity regions are found using a sliding window (of size --window_size) over the sequence; the complexity of the subsequence is assessed as in the "measuring" mode. The threshold below which the sequence is masked is configurable through the --threshold parameter and should be a value between 0-1 (least to most masking).

# For fastq files:
kz --mask < seqs.fq
# For fasta files:
kz --mask --fasta < seqs.fa
# Vary threshold:
kz --mask --threshold 0.7 < seqs.fq
# Vary window size:
kz --mask --window_size 64 < seqs.fq

Filtering

Filtering mode (--filter) outputs the input sequences with low-complexity sequences omitted. The threshold below which the sequence is masked is configurable through the --threshold parameter and should be a value between 0-1 (least to most masking). Note that this is most useful for Illumina reads or other short sequences, as scores will approach 1 for longer sequences.

# For fastq files:
kz --filter < seqs.fq
# For fasta files:
kz --filter --fasta < seqs.fa
# Vary threshold:
kz --filter --threshold 0.7 < seqs.fq

Note that you cannot filter and mask at the same time.

Parallelization

I didn't write any parallelization code into this as it's mostly I/O bound. Instead I suggest using GNU Parallel as follows:

FASTQ files

Use -L4 to define a record as being four lines. The -j parameter defines the number of cores to use, and --round-robin passes records to each of the spawned jobs. Any options to kz can be specified as normal. The input order is not maintained using this method.

cat <input> | parallel --round_robin -L4 -j<cores> --pipe kz

FASTA files

As above, but the record is now defined using --recstart '>'.

cat <input> | parallel --round_robin --recstart '>' -j<cores> --pipe kz --fasta