git clone https://github.com/bjmt/sequence-utils
cd sequence-utils
make
The following binaries are created:
bin/countfa Count length of sequences inside a fasta file
bin/countlets Count k-lets in a string
bin/countwin Count k-lets by window in a string
bin/seqgen Generate a random string
bin/shuffler Shuffle a string or sequences inside a fasta file
Run these with the -h flag to see usage.
Counts the number of characters per sequence in a fasta file. For each sequence, the name followed by the character count are returned to stdout. The aim is to count sequence lengths without taking up too much memory. To this end, only one character is loaded into memory at a time.
Example usage:
echo ">1\nACAAG\n>2\nGCCCGGTTAT" | bin/countfa
>1
5
>2
10
This utility counts the total number of k-lets in the input sequence. Be aware that the total number of k-lets is n^k, where n is the alphabet length. For use cases involving memory constraints, providing the sequence alphabet ahead of time will allow countlets to count k-lets while only needing to load k + 1 letters into memory at a time. When the alphabet is provided, it will typically never take up more than several MBs of memory. Optionally, k-lets with counts of zero can be ommitted from the output.
Example usage:
bin/countlets -k 1 -i example/sequence.txt
A 3301
C 1572
G 1619
T 3508
bin/countlets -k 1 -a ACGT -i example/sequence.txt
A 3301
C 1572
G 1619
T 3508
echo ACGTGA | bin/countlets -k 4 -n
ACGT 1
CGTG 1
GTGA 1
Count k-lets in the input string in windows. The window and step sizes can be controlled. The result is a tsv-formatted table with columns START, STOP, LET, and COUNT. Optionally, rows where the COUNT column is zero can be ommitted from the output.
Example usage:
echo ACGTGTGA | bin/countwin -a ACGT -k 3 -s 1 -n
START STOP LET COUNT
1 3 ACG 1
2 4 CGT 1
3 5 GTG 1
4 6 TGT 1
5 7 GTG 1
6 8 TGA 1
bin/countwin -i example/sequence.txt -a ACGT -k 1 -w 5000
START STOP LET COUNT
1 5000 A 1607
1 5000 C 815
1 5000 G 828
1 5000 T 1750
5001 10000 A 1694
5001 10000 C 757
5001 10000 G 791
5001 10000 T 1758
Create random sequences from any alphabet. Letters can be made up of any number of characters. Weights can be provided to modify random generation. If any of the input letters contain spaces, use quotation marks. New letters are streamed directly to the output, meaning memory usage is independent of output sequence length. Use commas to separate letters with multiple characters.
Example usage:
bin/seqgen -a ACGT -l 10 -s 11
CGAACTATTC
bin/seqgen -a ACGT -l 1000 -s 1 | bin/countlets
A 246
C 258
G 233
T 263
bin/seqgen -a "A,B,CD, " -l 10 -s 3
BCD BBBAAB
bin/seqgen -a ACGT -l 10 -s 11 -w 1,0.5,0.5,1
TTAGAATTTT
bin/seqgen -a ACGT -l 1000 -s 11 -w 1,0.5,0.5,1 | bin/countlets
A 338
C 150
G 169
T 343
Shuffles an input sequence using one of three methods. The default, euler, preserves exact k-let counts by performing a random Eulerian walk through the sequence. The implementation is based on the concept proposed by Altschul and Erickson (1985) as well as the cycle-popping algorithm proposed by Propp and Wilson (1998) for finding randomised Eulerian paths. One warning regarding this method: the number of vertices is equal to n^k, where n is the alphabet length. This means that if trying to shuffle a DNA sequence with k = 15, there will be over one billion vertices; and every vertex present in the input sequence has to be walked to the ending vertex, and those walks could potential be quite long themselves. Once a new Eulerian path has been found, the process is quite fast regardless of k; but the program may stall for quite some time trying to find such a path.
The second method, linear, splits the sequence every k letters before shuffling these around.
See:
AAACAGATT
After splitting (k = 2):
AA AC AG AT T
1 2 3 4 r
The k-let indices are shuffled, and the shuffled sequence reassembled from the split k-lets by their new index position. Any remainder letters which do not fit in a full k-let are left at the end.
The third method, markov, works by first getting the total number of all possible k-lets. Following this, a new sequence (of equal length) is created. Every new letter is chosen based on the probability of that letter (and the last k - 1 letters) appearing in a k-let in the original sequence. This results in a new sequence with similar (but not identical) k-let counts. The idea for this method of (pseudo-)shuffling is discussed by Fitch (1983).
Note: these methods only apply for k > 1. Otherwise, a simple shuffle call is performed.
Example usage:
euler shuffling:
bin/shuffler -i example/sequence.txt -k 2 | bin/countlets
A 3301
C 1572
G 1619
T 3508
markov shuffling:
bin/shuffler -i example/sequence.txt -k 2 -s 1 -m | bin/countlets
A 3326
C 1500
G 1613
T 3560
linear + fasta-formatted shuffling:
echo ">seq\nASCASCASDASCASDASDASC" | bin/shuffler -k 2 -l -s 1 -f
>seq
CAASASASSCSCDAASSDDAC
repeatedly shuffle input:
echo AAAACCCCGGGGTTTT | bin/shuffler -n 4
ACAATAGTGCTCTCGG
CACCGAGATCTTGGAT
TCATGCATACGGAGTC
AGATAGGAGCCTTCCT
TAAATCGAGGCGTTCC
A note on memory usage:
This program is not terribly memory efficient, usually requiring memory several times the size of the input sequence. This means shuffling billions of characters is not recommended unless you don't mind letting shuffler take up several GBs of memory.
Comparison with the command line version of uShuffle (Jiang et al. 2008):
Pros: Cons:
shuffler VS uShuffle benchmarks (using GNU Time):
A random DNA string with 100,000 characters was used as input. (e = euler, l = linear, m = markov)
k Program Peak memory (kB) Elapsed time (s)
1 uShuffle 1036 0.00 shuffler 1872 0.01
2 uShuffle 3796 0.00 shuffler (e) 3184 0.02 shuffler (l) 2272 0.01 shuffler (m) 2672 0.03
3 uShuffle 3796 0.01 shuffler (e) 3504 0.02 shuffler (l) 2144 0.01 shuffler (m) 2672 0.02
4 uShuffle 3780 0.01 shuffler (e) 3728 0.02 shuffler (l) 2080 0.00 shuffler (m) 2736 0.03
5 uShuffle 3792 0.01 shuffler (e) 3808 0.02 shuffler (l) 2032 0.01 shuffler (m) 2704 0.03
6 uShuffle 3820 0.01 shuffler (e) 3824 0.03 shuffler (l) 2016 0.00 shuffler (m) 2880 0.04
7 uShuffle 3908 0.02 shuffler (e) 4224 0.03 shuffler (l) 1984 0.00 shuffler (m) 3792 0.04
8 uShuffle 4296 0.05 shuffler (e) 5248 0.04 shuffler (l) 1984 0.01 shuffler (m) 8112 0.05
9 uShuffle 5384 0.08 shuffler (e) 12080 0.05 shuffler (l) 1872 0.00 shuffler (m) 28896 0.08
10 uShuffle 6380 0.10 shuffler (e) 39600 0.09 shuffler (l) 1872 0.00 shuffler (m) 107984 0.13
shuffler (euler) is quite competitive with uShuffle in terms of elapsed time and memory usage at low k (1-8). Above that shuffler becomes a bit inefficient; this is because shuffler creates a table of all possible k-lets (vertices) whereas uShuffle only keeps those found in the sequence in memory. Though realistically, most shuffling tasks do not require such high k values.
Altschul, S.F. and Erickson, B.W. (1985). Significance of Nucleotide Sequence Alignments: A Method for Random Sequence Permutation That Preserves Dinucleotide and Codon Usage. Molecular Biology and Evolution, 2(6), 526-538.
Fitch, W.M. (1983). Random Sequences. Journal of Molecular Biology, 163(2), 171-176.
Jiang, M., Anderson, J., Gillespie, J. and Mayne, M. (2008). uShuffle: A Useful Tool for Shuffling Biological Sequences While Preserving the k-let Counts. BMC Bioinformatics, 9(192).
Propp, J.G. and Wilson, D.W. (1998). How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph. Journal of Algorithms, 27(2), 170-217.