bjmt / sequence-utils

Sequence shuffler and other sequence-related utilities.
GNU General Public License v3.0
1 stars 0 forks source link
sequence shuffling utils

sequence-utils

Installation

git clone https://github.com/bjmt/sequence-utils
cd sequence-utils
make

The following binaries are created:

bin/countfa       Count length of sequences inside a fasta file
bin/countlets     Count k-lets in a string
bin/countwin      Count k-lets by window in a string
bin/seqgen        Generate a random string
bin/shuffler      Shuffle a string or sequences inside a fasta file

Run these with the -h flag to see usage.

countfa

Counts the number of characters per sequence in a fasta file. For each sequence, the name followed by the character count are returned to stdout. The aim is to count sequence lengths without taking up too much memory. To this end, only one character is loaded into memory at a time.

Example usage:

echo ">1\nACAAG\n>2\nGCCCGGTTAT" | bin/countfa

    >1
    5
    >2
    10

countlets

This utility counts the total number of k-lets in the input sequence. Be aware that the total number of k-lets is n^k, where n is the alphabet length. For use cases involving memory constraints, providing the sequence alphabet ahead of time will allow countlets to count k-lets while only needing to load k + 1 letters into memory at a time. When the alphabet is provided, it will typically never take up more than several MBs of memory. Optionally, k-lets with counts of zero can be ommitted from the output.

Example usage:

bin/countlets -k 1 -i example/sequence.txt

    A  3301
    C  1572
    G  1619
    T  3508

bin/countlets -k 1 -a ACGT -i example/sequence.txt

    A  3301
    C  1572
    G  1619
    T  3508

echo ACGTGA | bin/countlets -k 4 -n

    ACGT  1
    CGTG  1
    GTGA  1

countwin

Count k-lets in the input string in windows. The window and step sizes can be controlled. The result is a tsv-formatted table with columns START, STOP, LET, and COUNT. Optionally, rows where the COUNT column is zero can be ommitted from the output.

Example usage:

echo ACGTGTGA | bin/countwin -a ACGT -k 3 -s 1 -n

    START   STOP    LET     COUNT
    1       3       ACG     1
    2       4       CGT     1
    3       5       GTG     1
    4       6       TGT     1
    5       7       GTG     1
    6       8       TGA     1

bin/countwin -i example/sequence.txt -a ACGT -k 1 -w 5000

    START   STOP    LET     COUNT
    1       5000    A       1607
    1       5000    C       815
    1       5000    G       828
    1       5000    T       1750
    5001    10000   A       1694
    5001    10000   C       757
    5001    10000   G       791
    5001    10000   T       1758

seqgen

Create random sequences from any alphabet. Letters can be made up of any number of characters. Weights can be provided to modify random generation. If any of the input letters contain spaces, use quotation marks. New letters are streamed directly to the output, meaning memory usage is independent of output sequence length. Use commas to separate letters with multiple characters.

Example usage:

bin/seqgen -a ACGT -l 10 -s 11

  CGAACTATTC

bin/seqgen -a ACGT -l 1000 -s 1 | bin/countlets

  A  246
  C  258
  G  233
  T  263

bin/seqgen -a "A,B,CD, " -l 10 -s 3

  BCD BBBAAB

bin/seqgen -a ACGT -l 10 -s 11 -w 1,0.5,0.5,1

  TTAGAATTTT

bin/seqgen -a ACGT -l 1000 -s 11 -w 1,0.5,0.5,1 | bin/countlets

  A  338
  C  150
  G  169
  T  343

shuffler

Shuffles an input sequence using one of three methods. The default, euler, preserves exact k-let counts by performing a random Eulerian walk through the sequence. The implementation is based on the concept proposed by Altschul and Erickson (1985) as well as the cycle-popping algorithm proposed by Propp and Wilson (1998) for finding randomised Eulerian paths. One warning regarding this method: the number of vertices is equal to n^k, where n is the alphabet length. This means that if trying to shuffle a DNA sequence with k = 15, there will be over one billion vertices; and every vertex present in the input sequence has to be walked to the ending vertex, and those walks could potential be quite long themselves. Once a new Eulerian path has been found, the process is quite fast regardless of k; but the program may stall for quite some time trying to find such a path.

The second method, linear, splits the sequence every k letters before shuffling these around.

See:

AAACAGATT

After splitting (k = 2):

AA AC AG AT T
1  2  3  4  r

The k-let indices are shuffled, and the shuffled sequence reassembled from the split k-lets by their new index position. Any remainder letters which do not fit in a full k-let are left at the end.

The third method, markov, works by first getting the total number of all possible k-lets. Following this, a new sequence (of equal length) is created. Every new letter is chosen based on the probability of that letter (and the last k - 1 letters) appearing in a k-let in the original sequence. This results in a new sequence with similar (but not identical) k-let counts. The idea for this method of (pseudo-)shuffling is discussed by Fitch (1983).

Note: these methods only apply for k > 1. Otherwise, a simple shuffle call is performed.

Example usage:

euler shuffling:

bin/shuffler -i example/sequence.txt -k 2 | bin/countlets

    A  3301
    C  1572
    G  1619
    T  3508

markov shuffling:

bin/shuffler -i example/sequence.txt -k 2 -s 1 -m | bin/countlets

    A  3326
    C  1500
    G  1613
    T  3560

linear + fasta-formatted shuffling:

echo ">seq\nASCASCASDASCASDASDASC" | bin/shuffler -k 2 -l -s 1 -f

    >seq
    CAASASASSCSCDAASSDDAC

repeatedly shuffle input:

echo AAAACCCCGGGGTTTT | bin/shuffler -n 4

    ACAATAGTGCTCTCGG
    CACCGAGATCTTGGAT
    TCATGCATACGGAGTC
    AGATAGGAGCCTTCCT
    TAAATCGAGGCGTTCC

A note on memory usage:

This program is not terribly memory efficient, usually requiring memory several times the size of the input sequence. This means shuffling billions of characters is not recommended unless you don't mind letting shuffler take up several GBs of memory.

Comparison with the command line version of uShuffle (Jiang et al. 2008):

Pros: Cons:

shuffler VS uShuffle benchmarks (using GNU Time):

A random DNA string with 100,000 characters was used as input. (e = euler, l = linear, m = markov)

k Program Peak memory (kB) Elapsed time (s)


1 uShuffle 1036 0.00 shuffler 1872 0.01

2 uShuffle 3796 0.00 shuffler (e) 3184 0.02 shuffler (l) 2272 0.01 shuffler (m) 2672 0.03

3 uShuffle 3796 0.01 shuffler (e) 3504 0.02 shuffler (l) 2144 0.01 shuffler (m) 2672 0.02

4 uShuffle 3780 0.01 shuffler (e) 3728 0.02 shuffler (l) 2080 0.00 shuffler (m) 2736 0.03

5 uShuffle 3792 0.01 shuffler (e) 3808 0.02 shuffler (l) 2032 0.01 shuffler (m) 2704 0.03

6 uShuffle 3820 0.01 shuffler (e) 3824 0.03 shuffler (l) 2016 0.00 shuffler (m) 2880 0.04

7 uShuffle 3908 0.02 shuffler (e) 4224 0.03 shuffler (l) 1984 0.00 shuffler (m) 3792 0.04

8 uShuffle 4296 0.05 shuffler (e) 5248 0.04 shuffler (l) 1984 0.01 shuffler (m) 8112 0.05

9 uShuffle 5384 0.08 shuffler (e) 12080 0.05 shuffler (l) 1872 0.00 shuffler (m) 28896 0.08

10 uShuffle 6380 0.10 shuffler (e) 39600 0.09 shuffler (l) 1872 0.00 shuffler (m) 107984 0.13

shuffler (euler) is quite competitive with uShuffle in terms of elapsed time and memory usage at low k (1-8). Above that shuffler becomes a bit inefficient; this is because shuffler creates a table of all possible k-lets (vertices) whereas uShuffle only keeps those found in the sequence in memory. Though realistically, most shuffling tasks do not require such high k values.

References

Altschul, S.F. and Erickson, B.W. (1985). Significance of Nucleotide Sequence Alignments: A Method for Random Sequence Permutation That Preserves Dinucleotide and Codon Usage. Molecular Biology and Evolution, 2(6), 526-538.

Fitch, W.M. (1983). Random Sequences. Journal of Molecular Biology, 163(2), 171-176.

Jiang, M., Anderson, J., Gillespie, J. and Mayne, M. (2008). uShuffle: A Useful Tool for Shuffling Biological Sequences While Preserving the k-let Counts. BMC Bioinformatics, 9(192).

Propp, J.G. and Wilson, D.W. (1998). How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph. Journal of Algorithms, 27(2), 170-217.