ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
374 stars 67 forks source link

--minFraction too stringent for fragmented genomes? #70

Closed wwood closed 4 years ago

wwood commented 4 years ago

Hi, thanks a report by @apcamargo at https://github.com/wwood/galah/issues/7 I came across an issue with --minFraction on these fragmented genomes. They seem to align well:

$ fastANI -q a1.fna -r 2.fna -o /dev/stdout --minFraction 0.2 2>/dev/null
1.fna   2.fna   97.4762 228 629
$ fastANI -r 1.fna -q 29.fna -o /dev/stdout 2>/dev/null
2.fna   1.fna   98.351  232 255

But when --minFraction is used the hit goes away. This is even though 232/255 > 0.5:

$ fastANI -q 1.fna -r 2.fna -o /dev/stdout --minFraction 0.5 2>/dev/null
$ fastANI -r 1.fna -q 2.fna -o /dev/stdout --minFraction 0.5 2>/dev/null
$ fastANI -q 1.fna -r 2.fna -o /dev/stdout  --minFraction 0.5 --fragLen 1000 2>/dev/null
1.fna   2.fna   98.2643 1113    2276

(galah-dev) ben@u2:~/git/galah$ fastANI --version
version 1.31
(galah-dev) ben@u2:~/git/galah/antonio$ seqstat 1.fna
seqstat - show some simple statistics on a sequence file
SQUID 1.9g (January 2003)
Copyright (C) 1992-2003 HHMI/Washington University School of Medicine
Freely distributed under the GNU General Public License (GPL)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Format:              FASTA
Type (of 1st seq):   DNA
Number of sequences: 369
Total # residues:    2456354
Smallest:            2028
Largest:             32450
Average length:      6656.8
(galah-dev) ben@u2:~/git/galah/antonio$ seqstat 2.fna
seqstat - show some simple statistics on a sequence file
SQUID 1.9g (January 2003)
Copyright (C) 1992-2003 HHMI/Washington University School of Medicine
Freely distributed under the GNU General Public License (GPL)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Format:              FASTA
Type (of 1st seq):   DNA
Number of sequences: 409
Total # residues:    1468919
Smallest:            2005
Largest:             16940
Average length:      3591.5

Filtering out this alignment by the minFraction seems incorrect to me. I wonder what the definition of the minFraction actually is. Is it the fraction of the total genome length or the fraction of the genome that is long enough to be included as a fragment, or something along those lines?

Thanks, ben

cjain7 commented 4 years ago

Check this line in the code, it should be clear from there.

cjain7 commented 4 years ago

Essentially it checks the ratio of shared genome length vs. the length of the smaller of two genomes being compared. In older versions, FastANI had an absolute cutoff on shared genome length before trusting the ANI value, which was not good as the cutoff value should ideally depend on genome lengths.

wwood commented 4 years ago

So shared over length of smallest genome, rather than shared over length of smallest genome's hashed fragments, is that right?. So if a significant portion of the smallest genome is in small contigs, a significant portion cannot count in the numerator but is included in the denominator. Does that make sense? Should it be changed somehow?

cjain7 commented 4 years ago

I think I understand your point. Are you able to attach the fna files here? I'll take a look.

wwood commented 4 years ago

I've sent the MAGs to you over email just now.

cjain7 commented 4 years ago

Apologies for the delay in my response. I've revised the master branch to fix this. It is now working on this example.

$EXE -q MAG52.fna -r MAG189.fna -o /dev/stdout --minFraction 0.5
MAG52.fna   MAG189.fna  97.4762 228 629

$EXE -r MAG52.fna -q 189.fna -o /dev/stdout --minFraction 0.5
MAG189.fna  MAG52.fna   98.351  232 255
wwood commented 4 years ago

Thanks. You can use the genomes as public test cases if that is helpful, as galah has done.


From: Chirag Jain notifications@github.com Sent: Thursday, August 20, 2020 3:22:41 PM To: ParBLiSS/FastANI FastANI@noreply.github.com Cc: Ben J Woodcroft donttrustben@gmail.com; Author author@noreply.github.com Subject: Re: [ParBLiSS/FastANI] --minFraction too stringent for fragmented genomes? (#70)

Apologies for the delay in my response. I've revised the master branch to fix this. It is now working on this example.

$EXE -q MAG52.fna -r MAG189.fna -o /dev/stdout --minFraction 0.5 BE_RX_R2_MAG52.fna BE_RX_R3_MAG189.fna 97.4762 228 629 $EXE -r MAG52.fna -q 189.fna -o /dev/stdout --minFraction 0.5 BE_RX_R3_MAG189.fna BE_RX_R2_MAG52.fna 98.351 232 255

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ParBLiSS/FastANI/issues/70#issuecomment-677117909, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAADX5B7QRLIYROIXQNO7JTSBSXKDANCNFSM4O7A657Q.

simonrharris commented 4 years ago

Hi, would it be possible to add this fix to a release?

cjain7 commented 4 years ago

Done.