ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
374 stars 67 forks source link

Output is empty #49

Closed minjaekim45 closed 4 years ago

minjaekim45 commented 5 years ago

Hi Chirag I just tried to run fastani both v1.1 and 1.2 with two genomes in data folder (e coli and shigella) and output file is empty and this is the log I got.

Reference = [Shigella_flexneri_2a_01.fna] Query = [Escherichia_coli_str_K12_MG1655.fna] Kmer size = 16 Fragment length = 3000 Threads = 1 ANI output file = test.txt

INFO [thread 0], skch::main, Count of threads executing parallel_for : 1 INFO [thread 0], skch::Sketch::build, window size for minimizer sampling = 24 INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 4918 INFO [thread 0], skch::Sketch::index, unique minimizers = 2589 INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 1792) ... (36, 1) INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup. INFO [thread 0], skch::main, Time spent sketching the reference : 0.00427922 sec INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.0392048 sec INFO [thread 0], skch::main, Time spent post mapping : 1.1669e-05 sec INFO [thread 0], skch::main, ready to exit the loop INFO, skch::main, parallel_for execution finished

cjain7 commented 5 years ago

Hi Minjae, can you do some sanity check for the two genomes, e.g., their total lengths, N50 etc. Just wondering if this has anything to do with the quality of input genomes. I also recommend taking a look at the input parameters (available via -h option), and see if any of them is useful in your context.

If that doesn't help, it would be best to share the two genomes with me.

minjaekim45 commented 5 years ago

Thanks! I will try to reduce the fragsize

On Fri, Jul 26, 2019 at 3:16 PM Chirag Jain notifications@github.com wrote:

Hi Minjae, can you do some sanity check for the two genomes, e.g., their total lengths, N50 etc. Just wondering if this has anything to do with the quality of input genomes. I also recommend taking a look at the input parameters (available via -h option), and see if any of them is useful in your context.

If that doesn't help, it would be best to share the two genomes with me.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ParBLiSS/FastANI/issues/49?email_source=notifications&email_token=ABSK2AXCDCB2HZ3QIDWVGW3QBNLRJA5CNFSM4IHF6HCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD25TMIQ#issuecomment-515585570, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSK2AT462BJUWITSDOCJ6LQBNLRJANCNFSM4IHF6HCA .

E-co-syl commented 4 years ago

I also had the same problem with the two example files (E. coli and Shigella). I had tried to download the files from github - dont think what i got was a FASTA file, and thats what caused the run to fail?

It worked OK when i pulled the files from NCIMB

MrTomRod commented 4 years ago

I can replicate this bug with these fastas (they have already been published on GenBank): https://cloud.roder.casa/s/cC2jMzWaNwrJCkM

$fastANI -q FAM17927.fna -r FAM19036.fna -o out1.txt
...
$fastANI -r FAM17927.fna -q FAM19036.fna -o out2.txt
...
$cat cat out*
$
stats for FAM17927.fna
sum = 2757319, n = 49, ave = 56271.82, largest = 391390
N50 = 234031, n = 5
N60 = 230308, n = 6
N70 = 145185, n = 8
N80 = 76996, n = 10
N90 = 37143, n = 15
N100 = 305, n = 49
N_count = 7
Gaps = 7

Playing around with --fragLen didn't help. I also tried removing smaller scaffolds. Does it work on the computer FastANI was compiled on?

Recompiling didn't help either.

MrTomRod commented 4 years ago

@cjain7 - Sorry to bother you with this question, but I have a strategic decision to make about my own software. I'd like to integrate your tool because it's better and faster than the alternatives, but it has to work consistently.

Can you give me an approximate idea how long it will take you to fix this issue? (Can you reproduce the error with the files I provided?)

cjain7 commented 4 years ago

@MrTomRod Apologies for the delay in responding.

It appears that the genomes that you are trying to compare are too divergent... where as FastANI is designed for genome comparisons at ~80% or more identity.

I ended up checking what is the AAI (identity at amino-acid level) but looks like these two genomes are not comparable at protein-level too..

[jainc2@gry-compute050 MrTomRod]$ ../../Utility/enveomics/Scripts/aai.rb -N -1 FAM17927.fna -2 FAM19036.fna
Temporal directory: /tmp/d20200529-27816-mid2x0.
Creating databases.
  Reading FastA file: FAM17927.fna
    File contains 49 sequences.
  Reading FastA file: FAM19036.fna
    File contains 1 sequences.
Running one-way comparisons.
Insuffient hits to estimate one-way AAI: 1.
Insuffient hits to estimate one-way AAI: 1.
Insufficient hits to estimate two-way AAI: 1

[jainc2@gry-compute050 MrTomRod]$ ../../Utility/enveomics/Scripts/ani.rb -1 FAM17927.fna -2 FAM19036.fna
Temporal directory: /tmp/d20200529-27839-no6j5s.
Creating databases.
  Reading FastA file: FAM17927.fna
    Created 13581 fragments from 49 sequences, discarded 40814 bp.
  Reading FastA file: FAM19036.fna
    Created 18167 fragments from 1 sequences, discarded 0 bp.
Running one-way comparisons.
Insuffient hits to estimate one-way ANI: 9.
Insuffient hits to estimate one-way ANI: 36.
Insufficient hits to estimate two-way ANI: 4
cjain7 commented 4 years ago

If you wish to run the above comparison at your end, you are welcome to download the code here https://github.com/lmrodriguezr/enveomics/tree/master/Scripts

The above ANI / AAI scripts use BLAST (they are slower than FastANI, but may be more accurate)

MrTomRod commented 4 years ago

thanks a lot for the quick response!