dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
752 stars 295 forks source link

Discrepancies between exact counts and approximate counts #1619

Closed standage closed 7 years ago

standage commented 7 years ago

Given the strange results I've been getting with kevlar, I felt I had no choice but to compute exact k-mer counts for one of my human samples, compare these exact counts to the approximate counts given by khmer, and find how often big differences occur.

The procedure has not yet completed, but I'm already seeing some alarming results. I hope these are all just some big mistake on my part, but...maybe not? Just in the first part of the data, I've observed:

I don't know what to believe anymore.


Step 1: exact counts with jellyfish

jellyfish count \
    --mer-len=31 \
    --size=32G \
    --threads=16 \
    --output=NA19238.k31.jellyfish.counts \
    --out-counter-len=1 \
    --canonical \
    --timing=NA19238.k31.jellyfish/timing \
    --generator=<(echo 'gunzip -c NA19238.trim.fq.gz') \
    > jellyfish.log \
    2>&1

Step 2: approximate counts with khmer

>>> from __future__ import print_function
>>> import khmer
>>>
>>> ct = khmer.Counttable(31, 64e9 / 4, 4)
>>> ct.consume_fasta('NA19238.trim.fq.gz')
>>> ct.save('NA19238.k31.counttable')
>>> fpr = khmer.calc_expected_collisions(ct)
>>> print('FPR', fpr)

Step 3: compare counts

#!/usr/bin/env python
from __future__ import print_function, division
import argparse
import khmer
import subprocess
import sys
import time

parser = argparse.ArgumentParser()
parser.add_argument('--max-diff', type=int, default=2, metavar='DIFF',
                    help='report any k-mers where the difference between the '
                    'appoximate and exact counts is > DIFF; default is 2')
parser.add_argument('--debug-threshold', type=int, default=500000, metavar='N',
                    help='print a debugging update every N k-mers')
parser.add_argument('jf', help='jellyfish database')
parser.add_argument('kf', help='khmer counttable file')
args = parser.parse_args()

start = time.time()
kcounts = khmer._Counttable(1, [1])
kcounts.load(args.kf)
elapsed = time.time() - start
print('DEBUG loaded count table in {:.3f} seconds'.format(elapsed), file=sys.stderr)

undercounts = 0
overcounts = 0

cmd = 'jellyfish dump --column --tab'.split()
cmd += [args.jf]
jpipe = subprocess.Popen(cmd, stdout=subprocess.PIPE)
current_threshold = args.debug_threshold
allstart = time.time()
start = time.time()
for n, line in enumerate(jpipe.stdout):
    if n > 0 and n >= current_threshold:
        current_threshold += args.debug_threshold
        print('DEBUG loaded {} k-mers in {} seconds'.format(n, time.time() - start), file=sys.stderr)
        start = time.time()
    values = line.strip().split('\t')
    kmer = values[0]
    abund = int(values[1])
    kabund = kcounts.get(kmer)

    if kabund < abund:
        undercounts += 1
        print('UNDERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
    elif kabund - abund > args.max_diff:
        overcounts += 1
        print('OVERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
print('DEBUG loaded all {} kmers in {} seconds'.format(n, time.time() - allstart), file=sys.stderr)
print('Overcounts', overcounts, 'Undercounts', undercounts)
standage commented 7 years ago

I should note that I'm working off of branch research/banding_and_k32+. This is essentially a merge of refactor/hashing2 and feature/consume_bitsplit. The latter contains only additions, which should not affect the behavior of these tests.

betatim commented 7 years ago

No good idea off the top of my head. Is there a way to reproduce this with a smaller dataset/less RAM?

ctb commented 7 years ago

You would expect quite a few false positives from billions of k-mers, no?

The undercounts are MUCH more concerning to me.

standage commented 7 years ago

Is there a way to reproduce this with a smaller dataset/less RAM?

Yep, this is running on a large human sample. I'll do a similar comparison with E. coli and see what I come up with.

The undercounts are MUCH more concerning to me.

Yeah, this is very surprising. Either I'm doing something horribly wrong or we've got some serious work to do.

The procedure shown above has been running overnight and is not yet finished, but here is the current distribution of false negatives being reported: column 2 is approximate abundance (khmer) - exact abundance (jellyfish), column 1 is frequency (number of k-mers).

5564172 -1
 326417 -2
  19459 -3
   1889 -4
    363 -5
     82 -6
     26 -7
      5 -8
      1 -9
      2 -11
standage commented 7 years ago

You would expect quite a few false positives from billions of k-mers, no?

Indeed, but with a FPR ≈ 0.1 would you expect anything to be overcounted by 10? 20?

betatim commented 7 years ago

Indeed, but with a FPR ≈ 0.1 would you expect anything to be overcounted by 10? 20?

FPR -> false positive rate, not sure for counting bloom filters but if the naming is correct I would assume it means it gets it wrong at that rate but doesn't tell you anything about the size of the error?

standage commented 7 years ago

...but doesn't tell you anything about the size of the error?

Correct, as far as I understand. That is , FPR doesn't give you any direct information about the size of the error. Depending on the size of the table, some k-mers will have more collisions than average and some will have fewer than average (or none). It just seems extremely improbable that all N=4 buckets associated with a k-mer would be inflated to ≈20 when in reality the k-mer occurs once in the input. Either there's a mistake in the code, or there is some effect that violates our assumption of random collisions. Both @titus and @camillescott have suggested to look at handling of non-ACGT characters (N --> A could definitely explain something like this), but the consume_fasta functions currently discard any reads with non-ACGT.

standage commented 7 years ago

Ok, I just re-ran this procedure on SRR030252 (an E. coli data set). Used branch refactor/hashing2 this time.


Download and trim

fastq-dump \
        --split-files \
        --defline-seq '@$ac.$si.$sg/$ri' \
        --defline-qual '+' \
        -Z --gzip SRR030252 \
    > SRR030252.fq.gz

trim-low-abund.py \
        -k 17 -M 250M --variable-coverage \
        -o SRR030252.trim.fq.gz --gzip \
        SRR030252.fq.gz \
    2> SRR030252.trim.log &

Jellyfish

jellyfish count \
    --mer-len=19 \
    --size=1G \
    --threads=4 \
    --output=SRR030252.k19.jellyfish.counts \
    --out-counter=1 \
    --canonical \
    --timing=SRR030252.k19.jellyfish.timing \
    --generator=<(echo 'gunzip -c SRR030252.trim.fq.gz') \
    > SRR030252.k19.jellyfish.log \
    2>&1

khmer

>>> from __future__ import print_function
>>> import khmer
>>> 
>>> ct = khmer.Counttable(19, 2.5e8 / 4, 4)
>>> ct.consume_fasta('SRR030252.trim.fq.gz')
>>> ct.save('SRR030252.k19.counttable')
>>> fpr = khmer.calc_expected_collisions(ct)
>>> print('FPR', fpr)

Comparison

Saved as compare.py, executed with:

./compare.py \
        --debug-threshold 1000000 \
        SRR030252.k19.jellyfish.counts \
        SRR030252.k19.counttable \
    > SRR030252.k19.compare.txt 2>&1
#!/usr/bin/env python
from __future__ import print_function, division
import argparse
import khmer
import subprocess
import sys
import time

parser = argparse.ArgumentParser()
parser.add_argument('--max-diff', type=int, default=2, metavar='DIFF',
                    help='report any k-mers where the difference between the '
                    'appoximate and exact counts is > DIFF; default is 2')
parser.add_argument('--debug-threshold', type=int, default=500000, metavar='N',
                    help='print a debugging update every N k-mers')
parser.add_argument('jf', help='jellyfish database')
parser.add_argument('kf', help='khmer counttable file')
args = parser.parse_args()

start = time.time()
kcounts = khmer._Counttable(1, [1])
kcounts.load(args.kf)
elapsed = time.time() - start
print('DEBUG loaded count table in {:.3f} seconds'.format(elapsed), file=sys.stderr)

undercounts = 0
overcounts = 0

cmd = 'jellyfish dump --column --tab'.split()
cmd += [args.jf]
jpipe = subprocess.Popen(cmd, stdout=subprocess.PIPE)
current_threshold = args.debug_threshold
allstart = time.time()
start = time.time()
for n, line in enumerate(jpipe.stdout):
    if n > 0 and n >= current_threshold:
        current_threshold += args.debug_threshold
        print('DEBUG loaded {} k-mers in {} seconds'.format(n, time.time() - start), file=sys.stderr)
        start = time.time()
    values = line.strip().split('\t')
    kmer = values[0]
    abund = int(values[1])
    kabund = kcounts.get(kmer)

    if kabund < abund:
        undercounts += 1
        print('UNDERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
    elif kabund - abund > args.max_diff:
        overcounts += 1
        print('OVERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
print('DEBUG loaded all {} kmers in {} seconds'.format(n, time.time() - allstart), file=sys.stderr)
print('Overcounts', overcounts, 'Undercounts', undercounts)

Results

Distribution of undercounts.

$ grep UNDERCOUNT SRR030252.k19.compare.txt  \
        | cut -f 5 -d ' ' \
        | sort | uniq -c \
        | grep -v UNDER \
        | sort -rn -k2,2 
  33460 -1
     69 -2
      2 -3

Distribution of overcounts.

grep OVERCOUNT SRR030252.k19.compare.txt \
        | cut -f 5 -d ' ' \
        | sort | uniq -c \
        | sort -n -k2,2
    784 3
     90 4
     27 5
     14 6
     23 7
     32 8
     34 9
     50 10
     52 11
     62 12
     87 13
     77 14
     67 15
     72 16
     64 17
     38 18
     31 19
     19 20
      9 21
      9 22
      5 23
      3 24
      3 25
      1 26
standage commented 7 years ago

Ok, looking at 10 randomly selected undercounted kmers (all undercounted by 1), as well as some cherry-picked kmers undercounted by two, the explanation is clear (and not surprising in hindsight): these k-mers are all included in reads that contain Ns and are thus discarded by khmer. When it's undercounted by 1, there is 1 read containing an N. When it's undercounted by 2, there are 2 reads with Ns. Seems consistent.

standage commented 7 years ago

Prefiltering reads to remove those containing non-ACGT, we get no undercounts. Phew! The distribution of overcounts is identical.

standage commented 7 years ago

The 250M memory used in the E. coli example above yielded an FPR of 0.025 (less than the ≈0.1 FPR I've been using for the human stuff). When I increased memory to 2.5 G the FPR dropped to 5.93e-06, and ALL of the overcounts disappeared.

I've been skeptical that low FPRs could result in k-mers being overcounted by as much as 20, but I guess considering total data volume this all seems fairly reasonable.

ctb commented 7 years ago

yay w00t

camillescott commented 7 years ago

Seeing the actual overcount data, I'm no longer surprised. Somehow, when you initially said FPR=0.1, I was thinking 0.1%, blerp.

On Tue, Feb 14, 2017 at 12:40 PM, C. Titus Brown notifications@github.com wrote:

yay w00t

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dib-lab/khmer/issues/1619#issuecomment-279829055, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwxraie_mMjjVGne_v20cmdhHe0V3Pcks5rchE9gaJpZM4MAKHx .

-- Camille Scott

Graduate Group for Computer Science Lab for Data Intensive Biology University of California, Davis

camille.scott.w@gmail.com

ctb commented 7 years ago

In #1590, we are removing the code that ignores sequences with Ns in them.