BioInf-Wuerzburg / SeqFilter

Versatile FASTA/FASTQ sequence file analysis and modification tool
MIT License
9 stars 6 forks source link

--base-content output is not ordered correctly #7

Closed nterhoeven closed 5 years ago

nterhoeven commented 7 years ago

Just a small bug in the output of --base-content results. Apparently, the order in the header is the same as specified in the command and the order of the results is alphabetical.

Example:

$ SeqFilter --base-content N,GC my-sequences.fa
#source state   reads   bases   max min N50 N90 %N  %GC
my-sequences.fa RAW 3981    268398063   958516  1560    91768   30966   38.96   0.00
$ SeqFilter --base-content GC,N my-sequences.fa
#source state   reads   bases   max min N50 N90 %GC %N
my-sequences.fa RAW 3981    268398063   958516  1560    91768   30966   38.96   0.00
greatfireball commented 5 years ago

The problem is caused by the different handling of https://github.com/BioInf-Wuerzburg/SeqFilter/blob/d92ff27b810be76e4f9394c4f9e98648ebfbc915/bin/SeqFilter#L730 and https://github.com/BioInf-Wuerzburg/SeqFilter/blob/d92ff27b810be76e4f9394c4f9e98648ebfbc915/bin/SeqFilter#L995. As @nterhoeven already pointed out. I created a test set, and a fix and will create a pull request to solve that issue.

greatfireball commented 5 years ago

The test set is generated by the script generate_test_set.pl.

$ cat generate_test_set.pl
#!/usr/bin/env perl
use strict;
use warnings;
my @nucl=qw(A C G T);
my $percent_N = 3;
srand(20190211);
for(my $i=1; $i<=10; $i++)
{
    my $seq_length = rand(25000);
    my $name = sprintf(">seq%05d len:%d", $i, $seq_length);
    my $seq = "";
    while (length($seq)<$seq_length)
    {
    my $new_nucl = $nucl[rand(@nucl)] ;
    $new_nucl = "N" if (rand(100)<$percent_N);
    $seq .= $new_nucl;
    }
[testset.fa.gz](https://github.com/BioInf-Wuerzburg/SeqFilter/files/2856004/testset.fa.gz)

    print $name,"\n",$seq,"\n";
}

This results in a reproducible sequence set.

greatfireball commented 5 years ago

testset.fa.gz

Nucleotids Number Percentage
A 25073 24.41
C 24584 23.93
G 25103 24.44
N 3049 2.97
T 24904 24.25
AT 49977 48.66
GC 49687 48.37
greatfireball commented 5 years ago

Just let me add the test :)

I chose 100 random permutations for the base content N,AT,A,G,GC,T,C for the test: permutation.txt

for i in $(cat permutation.txt)
do
   SeqFilter --base-content $i testset.fa 2>/dev/null
done | sort | uniq -c

running the SeqFilter calls result in just one single output line:

#source     state  reads  bases   max    min  N50    N90   %A     %AT    %C     %G     %GC    %N    %T
testset.fa  RAW    10     102713  23333  993  15516  5903  24.41  48.66  23.93  24.44  48.37  2.97  24.25

containing the expected numbers.