agordon / fastx_toolkit

FASTA/FASTQ pre-processing programs
Other
167 stars 60 forks source link

fastx_quality_stats bug in median calculation #15

Open winni2k opened 7 years ago

winni2k commented 7 years ago

Hi, when I run the command

fastx_quality_stats -i input.fastq -o output.stats

, where input.fastq consists of

@0 <unknown description>
A
+
]
@1 <unknown description>
A
+
]

, then output.stats consists of

column  count   min     max     sum     mean    Q1      med     Q3      IQR     lW      rW      A_Count C_Count G_Count T_Count N_Count Max_count
1       2       60      60      120     60.00   60      50      50      -10     75      35      2       0       0       0
       0       2

. Note the med column has a value of 50, whereas the mean column has a value of 60, and the two quality scores in input.fastq are identical (]).

I believe this is a bug?

PS fastx_quality_stats -h prints:

usage: fastx_quality_stats [-h] [-N] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.14 by A. Gordon (assafgordon@gmail.com)

   [-h] = This helpful help screen.
   [-i INFILE]  = FASTQ input file. default is STDIN.
   [-o OUTFILE] = TEXT output file. default is STDOUT.
   [-N]         = New output format (with more information per nucleotide/cycle).

The *OLD* output TEXT file will have the following fields (one row per column):
    column  = column number (1 to 36 for a 36-cycles read solexa file)
    count   = number of bases found in this column.
    min     = Lowest quality score value found in this column.
    max     = Highest quality score value found in this column.
    sum     = Sum of quality score values for this column.
    mean    = Mean quality score value for this column.
    Q1  = 1st quartile quality score.
    med = Median quality score.
    Q3  = 3rd quartile quality score.
    IQR = Inter-Quartile range (Q3-Q1).
    lW  = 'Left-Whisker' value (for boxplotting).
    rW  = 'Right-Whisker' value (for boxplotting).
    A_Count = Count of 'A' nucleotides found in this column.
    C_Count = Count of 'C' nucleotides found in this column.
    G_Count = Count of 'G' nucleotides found in this column.
    T_Count = Count of 'T' nucleotides found in this column.
    N_Count = Count of 'N' nucleotides found in this column.
    max-count = max. number of bases (in all cycles)

The *NEW* output format:
    cycle (previously called 'column') = cycle number
    max-count
    For each nucleotide in the cycle (ALL/A/C/G/T/N):
        count   = number of bases found in this column.
        min     = Lowest quality score value found in this column.
        max     = Highest quality score value found in this column.
        sum     = Sum of quality score values for this column.
        mean    = Mean quality score value for this column.
        Q1  = 1st quartile quality score.
        med = Median quality score.
        Q3  = 3rd quartile quality score.
        IQR = Inter-Quartile range (Q3-Q1).
        lW  = 'Left-Whisker' value (for boxplotting).
        rW  = 'Right-Whisker' value (for boxplotting).
winni2k commented 7 years ago

Also, if I run fastx_quality_stats with the -N flag enabled, then I get the following output:

cycle   max_count       ALL_count       ALL_min ALL_max ALL_sum ALL_mean        ALL_Q1  ALL_med ALL_Q3  ALL_IQR ALL_lW  ALL_rW  A_count A_min   A_max   A_sum   A_mean  A_Q1    A_med   A_Q3    A_IQR   A_lW    A_rW    C_count C_min   C_max   C_sum   C_mean  C_Q1    C_med   C_Q3    C_IQR   C_lW    C_rW    G_count G_min   G_max   G_sum   G_mean  G_Q1    G_med   G_Q3    G_IQR   G_lW    G_rW    T_count T_min   T_max   T_sum   T_mean  T_Q1    T_med   T_Q3    T_IQR   T_lW    T_rW    N_count N_min   N_max   N_sum   N_mean  N_Q1    N_med   N_Q3    N_IQR   N_lW    N_rW
1       2       2       60      60      120     60.00   60      50      50      -10     75      35      2       60      60      120     60.00   60      -45     -45     -105    217     -202    0       100     -100    0       nan     100     100     100     0       100
     -100    0       100     -100    0       nan     100     100     100     0       100     -100    0       100     -100    0       nan     100     100     100     0       100     -100    0       100     -100    0       nan     100     100     100     0       100     -100

The median and quartile values still appear to be off.