Closed standage closed 7 years ago
I should note that I'm working off of branch research/banding_and_k32+
. This is essentially a merge of refactor/hashing2
and feature/consume_bitsplit
. The latter contains only additions, which should not affect the behavior of these tests.
No good idea off the top of my head. Is there a way to reproduce this with a smaller dataset/less RAM?
You would expect quite a few false positives from billions of k-mers, no?
The undercounts are MUCH more concerning to me.
Is there a way to reproduce this with a smaller dataset/less RAM?
Yep, this is running on a large human sample. I'll do a similar comparison with E. coli and see what I come up with.
The undercounts are MUCH more concerning to me.
Yeah, this is very surprising. Either I'm doing something horribly wrong or we've got some serious work to do.
The procedure shown above has been running overnight and is not yet finished, but here is the current distribution of false negatives being reported: column 2 is approximate abundance (khmer) - exact abundance (jellyfish), column 1 is frequency (number of k-mers).
5564172 -1
326417 -2
19459 -3
1889 -4
363 -5
82 -6
26 -7
5 -8
1 -9
2 -11
You would expect quite a few false positives from billions of k-mers, no?
Indeed, but with a FPR ≈ 0.1 would you expect anything to be overcounted by 10? 20?
Indeed, but with a FPR ≈ 0.1 would you expect anything to be overcounted by 10? 20?
FPR -> false positive rate, not sure for counting bloom filters but if the naming is correct I would assume it means it gets it wrong at that rate but doesn't tell you anything about the size of the error?
...but doesn't tell you anything about the size of the error?
Correct, as far as I understand. That is , FPR doesn't give you any direct information about the size of the error. Depending on the size of the table, some k-mers will have more collisions than average and some will have fewer than average (or none). It just seems extremely improbable that all N=4 buckets associated with a k-mer would be inflated to ≈20 when in reality the k-mer occurs once in the input. Either there's a mistake in the code, or there is some effect that violates our assumption of random collisions. Both @titus and @camillescott have suggested to look at handling of non-ACGT characters (N --> A could definitely explain something like this), but the consume_fasta
functions currently discard any reads with non-ACGT.
Ok, I just re-ran this procedure on SRR030252 (an E. coli data set). Used branch refactor/hashing2
this time.
fastq-dump \
--split-files \
--defline-seq '@$ac.$si.$sg/$ri' \
--defline-qual '+' \
-Z --gzip SRR030252 \
> SRR030252.fq.gz
trim-low-abund.py \
-k 17 -M 250M --variable-coverage \
-o SRR030252.trim.fq.gz --gzip \
SRR030252.fq.gz \
2> SRR030252.trim.log &
jellyfish count \
--mer-len=19 \
--size=1G \
--threads=4 \
--output=SRR030252.k19.jellyfish.counts \
--out-counter=1 \
--canonical \
--timing=SRR030252.k19.jellyfish.timing \
--generator=<(echo 'gunzip -c SRR030252.trim.fq.gz') \
> SRR030252.k19.jellyfish.log \
2>&1
>>> from __future__ import print_function
>>> import khmer
>>>
>>> ct = khmer.Counttable(19, 2.5e8 / 4, 4)
>>> ct.consume_fasta('SRR030252.trim.fq.gz')
>>> ct.save('SRR030252.k19.counttable')
>>> fpr = khmer.calc_expected_collisions(ct)
>>> print('FPR', fpr)
Saved as compare.py
, executed with:
./compare.py \
--debug-threshold 1000000 \
SRR030252.k19.jellyfish.counts \
SRR030252.k19.counttable \
> SRR030252.k19.compare.txt 2>&1
#!/usr/bin/env python
from __future__ import print_function, division
import argparse
import khmer
import subprocess
import sys
import time
parser = argparse.ArgumentParser()
parser.add_argument('--max-diff', type=int, default=2, metavar='DIFF',
help='report any k-mers where the difference between the '
'appoximate and exact counts is > DIFF; default is 2')
parser.add_argument('--debug-threshold', type=int, default=500000, metavar='N',
help='print a debugging update every N k-mers')
parser.add_argument('jf', help='jellyfish database')
parser.add_argument('kf', help='khmer counttable file')
args = parser.parse_args()
start = time.time()
kcounts = khmer._Counttable(1, [1])
kcounts.load(args.kf)
elapsed = time.time() - start
print('DEBUG loaded count table in {:.3f} seconds'.format(elapsed), file=sys.stderr)
undercounts = 0
overcounts = 0
cmd = 'jellyfish dump --column --tab'.split()
cmd += [args.jf]
jpipe = subprocess.Popen(cmd, stdout=subprocess.PIPE)
current_threshold = args.debug_threshold
allstart = time.time()
start = time.time()
for n, line in enumerate(jpipe.stdout):
if n > 0 and n >= current_threshold:
current_threshold += args.debug_threshold
print('DEBUG loaded {} k-mers in {} seconds'.format(n, time.time() - start), file=sys.stderr)
start = time.time()
values = line.strip().split('\t')
kmer = values[0]
abund = int(values[1])
kabund = kcounts.get(kmer)
if kabund < abund:
undercounts += 1
print('UNDERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
elif kabund - abund > args.max_diff:
overcounts += 1
print('OVERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
print('DEBUG loaded all {} kmers in {} seconds'.format(n, time.time() - allstart), file=sys.stderr)
print('Overcounts', overcounts, 'Undercounts', undercounts)
Distribution of undercounts.
$ grep UNDERCOUNT SRR030252.k19.compare.txt \
| cut -f 5 -d ' ' \
| sort | uniq -c \
| grep -v UNDER \
| sort -rn -k2,2
33460 -1
69 -2
2 -3
Distribution of overcounts.
grep OVERCOUNT SRR030252.k19.compare.txt \
| cut -f 5 -d ' ' \
| sort | uniq -c \
| sort -n -k2,2
784 3
90 4
27 5
14 6
23 7
32 8
34 9
50 10
52 11
62 12
87 13
77 14
67 15
72 16
64 17
38 18
31 19
19 20
9 21
9 22
5 23
3 24
3 25
1 26
Ok, looking at 10 randomly selected undercounted kmers (all undercounted by 1), as well as some cherry-picked kmers undercounted by two, the explanation is clear (and not surprising in hindsight): these k-mers are all included in reads that contain Ns and are thus discarded by khmer. When it's undercounted by 1, there is 1 read containing an N. When it's undercounted by 2, there are 2 reads with Ns. Seems consistent.
Prefiltering reads to remove those containing non-ACGT, we get no undercounts. Phew! The distribution of overcounts is identical.
The 250M memory used in the E. coli example above yielded an FPR of 0.025 (less than the ≈0.1 FPR I've been using for the human stuff). When I increased memory to 2.5 G the FPR dropped to 5.93e-06, and ALL of the overcounts disappeared.
I've been skeptical that low FPRs could result in k-mers being overcounted by as much as 20, but I guess considering total data volume this all seems fairly reasonable.
yay w00t
Seeing the actual overcount data, I'm no longer surprised. Somehow, when you initially said FPR=0.1, I was thinking 0.1%, blerp.
On Tue, Feb 14, 2017 at 12:40 PM, C. Titus Brown notifications@github.com wrote:
yay w00t
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dib-lab/khmer/issues/1619#issuecomment-279829055, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwxraie_mMjjVGne_v20cmdhHe0V3Pcks5rchE9gaJpZM4MAKHx .
-- Camille Scott
Graduate Group for Computer Science Lab for Data Intensive Biology University of California, Davis
camille.scott.w@gmail.com
In #1590, we are removing the code that ignores sequences with Ns in them.
Given the strange results I've been getting with kevlar, I felt I had no choice but to compute exact k-mer counts for one of my human samples, compare these exact counts to the approximate counts given by khmer, and find how often big differences occur.
The procedure has not yet completed, but I'm already seeing some alarming results. I hope these are all just some big mistake on my part, but...maybe not? Just in the first part of the data, I've observed:
I don't know what to believe anymore.
Step 1: exact counts with jellyfish
Step 2: approximate counts with khmer
Step 3: compare counts