ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
374 stars 67 forks source link

Trouble with --minFraction #63

Closed wwood closed 4 years ago

wwood commented 4 years ago

Hi,

I'm running fastANI 1.3 from bioconda between 2 short sequences, each in their own file - see https://gist.github.com/wwood/d47dfa28748edcbfcbede4453d76cf7f

These sequences are nearly identical, except one is ~14kb while the other is ~15kb.

Strangely, if I turn up the --minFraction, the alignment of the first against the second gives no result, while the reverse does give the alignment:

(galah-dev) uqbwoodc@deakin:20200430:/tmp/fastani_error$ fastANI -q sequence1.fna -r sequence2.fna -o /dev/stdout --minFraction 0.8
>>>>>>>>>>>>>>>>>>
Reference = [sequence2.fna]
Query = [sequence1.fna]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = /dev/stdout
>>>>>>>>>>>>>>>>>>
INFO [thread 0], skch::main, Count of threads executing parallel_for : 1
INFO [thread 0], skch::Sketch::build, window size for minimizer sampling  = 24
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 1220
INFO [thread 0], skch::Sketch::index, unique minimizers = 1220
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 1220) ... (1, 1220)
INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 0.00113045 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.00207075 sec
INFO [thread 0], skch::main, Time spent post mapping : 1.2364e-05 sec
INFO [thread 0], skch::main, ready to exit the loop
INFO, skch::main, parallel_for execution finished
(galah-dev) uqbwoodc@deakin:20200430:/tmp/fastani_error$ fastANI -q sequence2.fna -r sequence1.fna -o /dev/stdout --minFraction 0.8
>>>>>>>>>>>>>>>>>>
Reference = [sequence1.fna]
Query = [sequence2.fna]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = /dev/stdout
>>>>>>>>>>>>>>>>>>
INFO [thread 0], skch::main, Count of threads executing parallel_for : 1
INFO [thread 0], skch::Sketch::build, window size for minimizer sampling  = 24
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 1122
INFO [thread 0], skch::Sketch::index, unique minimizers = 1122
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 1122) ... (1, 1122)
INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 0.000953353 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.00242072 sec
INFO [thread 0], skch::main, Time spent post mapping : 7.913e-06 sec
INFO [thread 0], skch::main, ready to exit the loop
INFO, skch::main, parallel_for execution finished
sequence2.fna   sequence1.fna   99.9734 4   5

Any ideas? Thanks in advance. ben

cjain7 commented 4 years ago

I'm using the latest source code from master branch. At my end, I see the following behavior:

[jainc2@gry-compute050 small_seq]$ ../../fastANI -q seq1.fa -r seq2.fa -o /dev/stdout
>>>>>>>>>>>>>>>>>>
Reference = [seq2.fa]
Query = [seq1.fa]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = /dev/stdout
>>>>>>>>>>>>>>>>>>
INFO [thread 0], skch::main, Count of threads executing parallel_for : 1
INFO [thread 0], skch::Sketch::build, window size for minimizer sampling  = 24
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 1220
INFO [thread 0], skch::Sketch::index, unique minimizers = 1220
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 1220) ... (1, 1220)
INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 0.00153976 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.00268185 sec
INFO [thread 0], skch::main, Time spent post mapping : 7.12e-06 sec
INFO [thread 0], skch::main, ready to exit the loop
INFO, skch::main, parallel_for execution finished
seq1.fa seq2.fa 99.9739 3   4

Reversed:

[jainc2@gry-compute050 small_seq]$ ../../fastANI -r seq1.fa -q seq2.fa -o /dev/stdout
>>>>>>>>>>>>>>>>>>
Reference = [seq1.fa]
Query = [seq2.fa]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = /dev/stdout
>>>>>>>>>>>>>>>>>>
INFO [thread 0], skch::main, Count of threads executing parallel_for : 1
INFO [thread 0], skch::Sketch::build, window size for minimizer sampling  = 24
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 1122
INFO [thread 0], skch::Sketch::index, unique minimizers = 1122
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 1122) ... (1, 1122)
INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 0.00141419 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.00310046 sec
INFO [thread 0], skch::main, Time spent post mapping : 7.211e-06 sec
INFO [thread 0], skch::main, ready to exit the loop
INFO, skch::main, parallel_for execution finished
seq2.fa seq1.fa 99.9734 4   5

Note that ANI values are slightly different. This is expected (see #54).

cjain7 commented 4 years ago

Oops I didn't see that you are turning up the default --minFraction

The difference comes from the fact that shared genome length (e.g., 3 vs. 4 fragments) differs slightly in both cases. Again, this is due to asymmetry in FastANI, order of two genomes can lead to small differences. One solution could be to generate output in .matrix format using --matrix parameter. That would give you consistent results.

wwood commented 4 years ago

OK, thanks, makes sense. I can rescue it by turning the fragLength to 1000bp, for instance.