Closed nterhoeven closed 5 years ago
The problem is caused by the different handling of https://github.com/BioInf-Wuerzburg/SeqFilter/blob/d92ff27b810be76e4f9394c4f9e98648ebfbc915/bin/SeqFilter#L730 and https://github.com/BioInf-Wuerzburg/SeqFilter/blob/d92ff27b810be76e4f9394c4f9e98648ebfbc915/bin/SeqFilter#L995. As @nterhoeven already pointed out. I created a test set, and a fix and will create a pull request to solve that issue.
The test set is generated by the script generate_test_set.pl
.
$ cat generate_test_set.pl
#!/usr/bin/env perl
use strict;
use warnings;
my @nucl=qw(A C G T);
my $percent_N = 3;
srand(20190211);
for(my $i=1; $i<=10; $i++)
{
my $seq_length = rand(25000);
my $name = sprintf(">seq%05d len:%d", $i, $seq_length);
my $seq = "";
while (length($seq)<$seq_length)
{
my $new_nucl = $nucl[rand(@nucl)] ;
$new_nucl = "N" if (rand(100)<$percent_N);
$seq .= $new_nucl;
}
[testset.fa.gz](https://github.com/BioInf-Wuerzburg/SeqFilter/files/2856004/testset.fa.gz)
print $name,"\n",$seq,"\n";
}
This results in a reproducible sequence set.
Nucleotids | Number | Percentage |
---|---|---|
A | 25073 | 24.41 |
C | 24584 | 23.93 |
G | 25103 | 24.44 |
N | 3049 | 2.97 |
T | 24904 | 24.25 |
AT | 49977 | 48.66 |
GC | 49687 | 48.37 |
Just let me add the test :)
I chose 100 random permutations for the base content N,AT,A,G,GC,T,C
for the test:
permutation.txt
for i in $(cat permutation.txt)
do
SeqFilter --base-content $i testset.fa 2>/dev/null
done | sort | uniq -c
running the SeqFilter calls result in just one single output line:
#source state reads bases max min N50 N90 %A %AT %C %G %GC %N %T
testset.fa RAW 10 102713 23333 993 15516 5903 24.41 48.66 23.93 24.44 48.37 2.97 24.25
containing the expected numbers.
Just a small bug in the output of
--base-content
results. Apparently, the order in the header is the same as specified in the command and the order of the results is alphabetical.Example: