MrOlm / drep

Rapid comparison and dereplication of genomes
242 stars 35 forks source link

fastANI is not working #96

Open Biofarmer opened 3 years ago

Biofarmer commented 3 years ago

Hi Dr. Olm,

I am using version 2.6.2, when running dRep compare with --S_algorithm fastANI, there is an error:

Clustering Step 1. Parse Arguments Clustering Step 2. Perform MASH (primary) clustering 2a. Run pair-wise MASH clustering 2b. Cluster pair-wise MASH clustering 3355 primary clusters made Step 3. Perform secondary clustering Running 8999390 fastANI comparisons- should take ~ 1200.5 min Traceback (most recent call last): File "/install/software/anaconda3.6.b/bin/dRep", line 33, in controller.parseArguments(args) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 146, in parseArguments self.compare_operation(vars(args)) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 91, in compare_operation drep.d_workflows.compare_wrapper(kwargs['work_directory'],kwargs) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_workflows.py", line 96, in compare_wrapper drep.d_cluster.d_cluster_wrapper(wd, kwargs) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 80, in d_cluster_wrapper data_folder, wd=workDirectory, kwargs) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 215, in cluster_genomes ndb = compare_genomes(bdb, algorithm, data_folder, kwargs) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 921, in compare_genomes df = run_pairwise_fastANI(genome_list, working_data_folder, kwargs) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 1096, in run_pairwise_fastANI exe_loc = drep.get_exe('fastANI') File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/init.py", line 100, in get_exe assert False, "{0} isn't working- make sure its installed".format(name) AssertionError: fastANI isn't working- make sure its installed

May I ask that should I install the fastANI separately? If yes, how can I make sure the dRep can call it? We already have FastANI 1.1 installed.

Thanks

Biofarmer commented 3 years ago

Hi Matt,

I have run drep (v3.2.0) compare with fastani without multiround_primary_clustering and greedy_secondary_clustering, and it is running.

-pa 0.90 -sa 0.95 -nc 0.30 -cm larger --S_algorithm fastANI

I just noticed that when doing the secondary cluster, the cluster number is not from 1 to n (drep 2.6.2 is from cluster 1 to the end), may I ask what the order of secondary cluster to run is based on? It looks to be random? Thanks Wang

MrOlm commented 3 years ago

Hello,

In answer to your first questions: 1) The main difference in precision will come down to the fact that greedy_secondary_clustering requires the use of single clustering, whereas if you run it without greedy you can use other clustering algorithms like average. Here's some info on why that can matter - https://drep.readthedocs.io/en/latest/choosing_parameters.html#oddities-of-hierarchical-clustering

2) It should't matter much. 10,000 seems good, but I wouldn't go over 20,000 or so (after that will take a lot of RAM)

3) Correct; at the moment genome length is the only variable considered.

4) Yes you are correct- this should read "above or greater".

Yes, you're correct that centrifuge and checkm are not needed.

Yes, when doing greedy secondary clustering the order in which clusters are run is random I believe.

Best, Matt

Biofarmer commented 3 years ago

Hi Matt,

Thanks for useful reply. If not using greedy secondary clustering, is the order in which clusters are run also random? You can see the code that the greedy secondary clustering is not included.

Thanks

MrOlm commented 3 years ago

Oh yes- it looks like the code was just restructured such that it's always random now. My mistake

Biofarmer commented 3 years ago

Okay, thanks a million, Wang

Biofarmer commented 3 years ago

Hi Matt, Just out of curiosity, "querry" is the name of one column from Ndb.csv from Drep, why not "query"? Best, Wang

MrOlm commented 3 years ago

Lol, because I misspelled it when I first wrote dRep in ~2016, and it's not too intertwined in my code to easily fix it :)

Biofarmer commented 3 years ago

Okay, just take as 'query' compared to reference, right?

MrOlm commented 3 years ago

yeah exactly

Biofarmer commented 3 years ago

Lol, thanks.

Biofarmer commented 3 years ago

Hi Matt,

As for the warning, I read 'secondary clusters that were almost different alerts the user to cases where genomes are on the edge between being considered “same” or “different”. That is, if a genome is close to one of the differentiating lines in the Primary and Secondary Clustering Dendrograms shown above.' which makes sense to me. I just want to know: for example in my warning log, 'CLUSTERING WARNING: Primary cluster 313 was almost not split' means the cluster was finally split in the result; and 'CLUSTERING WARNING: Primary cluster 373 was almost split' means the cluster was not split, right? Sorry about my poor English. In addition, the warning can be ignored anyway and the clustering result is still reliable, right?

Many thanks Wang

MrOlm commented 3 years ago

Yes your understanding correctly. And yes you can still use the clustering as reliable even when warnings are present; I ignore the warnings in my own research

-Matt

Biofarmer commented 3 years ago

Thanks a million, Matt. Best, Wang

Biofarmer commented 2 years ago

Hi Matt, I have sent one email to your gmail obtained from your github homepage, which is about the paper https://www.nature.com/articles/s41467-017-02018-w, may I ask whether you know it? Thanks, Wang

Biofarmer commented 2 years ago

Hi Matt, I just found a failure to make plot in dRep (v2.6.2) compare Step 4. Analyze as below:

making plots 1, 2, 3, 4 Plotting primary dendrogram Failed to make plot #1: Image size of 1000x607960 pixels is too large. It must be less than 2^16 in each direction.

I think this does not matter for the clustering results, right? Thanks Wang

MrOlm commented 2 years ago

Yes, that will not impact clustering results.

In response to your previous question, yes I know about that paper. It uses methods from the pre-inStrain days. Do you have a question about it?

Biofarmer commented 2 years ago

Hi Matt, Thank you for your fast reply, that's great it does not impact the clustering results.

As for the other question, I have a question about the metadata for four samples, and sent it to mattolm@gmail.com, do you receive it?

Thanks Wang

Biofarmer commented 2 years ago

Yes, that will not impact clustering results.

In response to your previous question, yes I know about that paper. It uses methods from the pre-inStrain days. Do you have a question about it?

Sorry to follow this question, as indicating to fail to make plot for primary cluster, why I still find 'Primary_clustering_dendrogram.pdf' in the folder of 'figures'?

MrOlm commented 2 years ago

The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind.

I don't believe I got the email. Maybe try re-sending to mattolm@stanford.edu

Biofarmer commented 2 years ago

The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind.

I don't believe I got the email. Maybe try re-sending to mattolm@stanford.edu

So, the Primary_clustering_dendrogram.pdf in this case is not complete, right? and this failure to make plot is due to too large number of genomes?

Thanks

Biofarmer commented 2 years ago

The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind. I don't believe I got the email. Maybe try re-sending to mattolm@stanford.edu

So, the Primary_clustering_dendrogram.pdf in this case is not complete, right? and this failure to make plot is due to too large number of genomes?

Thanks

Hi Matt, do I understand correctly? Thanks, Wang

MrOlm commented 2 years ago

I think the figure it not complete, but I'm not entirely sure what happens when matplotlib encounters that problem

Biofarmer commented 2 years ago

I think the figure it not complete, but I'm not entirely sure what happens when matplotlib encounters that problem

Hi Matt, okay, thanks. I think this failure to make plot is due to too large number of genomes, right? When I run less genomes, this plot was plotted without failure.

MrOlm commented 2 years ago

Yes exactly

Biofarmer commented 2 years ago

Good to learn. Anyway, the unaffected clustering results are the most important. Many thanks.

Biofarmer commented 2 years ago

Hi Matt, May I ask a question about the memory used? If I run 30000 genomes at once (~90G in size) with 40 threads, how much memory and storage (considering the intermediate files) has to be used approximately? Thanks

brittanysuttner commented 1 year ago

Hello, I think I found a bug in dRep related to running it with the -l option less than 3000. I am trying to derep a collection of viral genomes (many of them are less than 2k) so I run derep as such: > dRep dereplicate --S_algorithm fastANI -nc .5 -l 1000 -d -sa 0.99 -N50W 0 -sizeW 1 --ignoreGenomeQuality --clusterAlg single --multiround_primary_clustering drep_pig_virus_sa99 -g genomes/*ffn

This results in the majority of my genomes failing the fastANI step and I think its becasue the default FragmentLength for fastANI is 3000 and that should be lowered when running dRep with -l 1000 (maybe FragmentLength should be set equal to whatever is set for -l if its less than 3000?). Or there should be an option in the dRep command to set options in the fastANI command. Thanks!

MrOlm commented 1 year ago

Hello,

Can you please confirm you’re running the most up-to-date versions of dRep and fastANI? I remember this being an issue that I believe I fixed

-Matt

brittanysuttner commented 1 year ago

Yes, I am running the most up-to-date versions *I think of dRep (v3.4.0) and fastANI (v1.33). I installed dRep using pip install so it just installed that version by default.

MrOlm commented 1 year ago

Ahhh OK I see. I used to have an option to set the fragment length, but it ended up being a problem with different version of inStrain. I will try and address this in the next dRep update- thank you for bringing it to my attention

-MO

Gian77 commented 1 year ago

Hello, I know this is closed but I am having this weird output from dRep. dRep is telling me

...
...
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 298, in secondary_clustering
    ndb = compare_genomes(bdb, algorithm, data_folder, **kwargs)
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 367, in compare_genomes
    df = drep.d_cluster.external.run_pairwise_fastANI(genome_list, working_data_folder, **kwargs)
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/external.py", line 93, in run_pairwise_fastANI
    exe_loc = drep.get_exe('fastANI')
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/__init__.py", line 100, in get_exe
    raise ValueError("{0} isn't working- make sure its installed".format(name))
ValueError: fastANI isn't working- make sure its installed

but fstani is there

dendropy                  4.5.2              pyh3252c3a_0    bioconda
drep                      3.4.0              pyhdfd78af_0    bioconda
expat                     2.4.4                h295c915_0    anaconda
fastani                   1.33                 h0fdf51a_0    bioconda
fftw                      3.3.9                h27cfd23_1  

and

mash.................................... all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/mash)
nucmer.................................. all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/nucmer)
checkm.................................. all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/checkm)
ANIcalculator........................... !!! ERROR !!!   (location = None)
prodigal................................ all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/prodigal)
centrifuge.............................. !!! ERROR !!!   (location = None)
nsimscan................................ !!! ERROR !!!   (location = None)
fastANI................................. !!! ERROR !!!   (location = /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI)

what is happening? Maybe doesnt like the fastANI version? Thanks much! G.

MrOlm commented 1 year ago

Hi @Gian77 -

What happens when you try the command /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI -h?

FastANI sometimes requires its own dependencies, which could be why this isn't working

-Matt

Gian77 commented 1 year ago

Hey @MrOlm

Thanks fro your fast answer on this.

This, what comes out, what's libgsl.so.25?

(drep) [benucci@dev-intel18 DAS_Tool_bins]$ /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI -h
/mnt/home/benucci/anaconda2/envs/drep/bin/fastANI: error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory

Gian

MrOlm commented 1 year ago

Hi Gian,

This is what I suspected. This has something to do with C++ and how fastANI is written, which is something I don't fully understand.

Someone else was able to fix the problem here - https://github.com/MrOlm/drep/issues/146

And a discussion of the problem on fastANI's gitHub can be found here - https://github.com/ParBLiSS/FastANI/issues/96

Best, Matt

Gian77 commented 1 year ago

Thanks much! I dug this out...

I was able to make fastANI to work by installing the right gsl libraries in the conda environemnt

conda install -c conda-forge gsl=2.7=he838d99_0

dRep worked after this fix.

Gian

AlessioMilanese commented 9 months ago

Following on the discussion.

Main problem: fastANI -h return 1, and I cannot run dRep.


I have fastANI installed and it works. I try to run:

$ fastANI -q test/2838728.3.fna -r test/2838728.3.fna -o test
$ cat test
test/2838728.3.fna      test/2838728.3.fna      100     353     357

But if I do:

$ fastANI -h
$ echo $?
1

Which basically means it return an error when you call it with -h.


dRep cannot proceed, even if fastANI is working, error:

    exe_loc = drep.get_exe('fastANI')
  File "/home/ec2-user/Software/miniconda3/envs/drep/lib/python3.7/site-packages/drep/__init__.py", line 100, in get_exe
    raise ValueError("{0} isn't working- make sure its installed".format(name))
ValueError: fastANI isn't working- make sure its installed

Tool version:

drep                      3.4.5              pyhdfd78af_0    bioconda
fastani                   1.1                  h4ef8376_0    bioconda
MrOlm commented 9 months ago

Hi @AlessioMilanese -

Thanks for tracking down this issue and posting here- I very much appreciate it! I will address in the next dRep update (or please feel free to submit a pull request if you're inclined). In the meantime, it seems that updating fastANI to new versions fixes this fastANI behavior.

Thanks again! Matt

AlessioMilanese commented 9 months ago

Thanks for the fast response Matt. For some reasons I cannot install a newer version of FastANI, but I will figure it out.

I will address in the next dRep update (or please feel free to submit a pull request if you're inclined)

I'm not sure what is the best solution. I think I would add an option (like -I) to skip the part where you check if the tool is installed and working.

MrOlm commented 9 months ago

Hi @AlessioMilanese - yeah that would be a fine solution, or (if possible) just doing a different check for fastANI that doesn't return a 0 error code