Open Biofarmer opened 3 years ago
Hi Matt,
I have run drep (v3.2.0) compare with fastani without multiround_primary_clustering and greedy_secondary_clustering, and it is running.
-pa 0.90 -sa 0.95 -nc 0.30 -cm larger --S_algorithm fastANI
I just noticed that when doing the secondary cluster, the cluster number is not from 1 to n (drep 2.6.2 is from cluster 1 to the end), may I ask what the order of secondary cluster to run is based on? It looks to be random? Thanks Wang
Hello,
In answer to your first questions: 1) The main difference in precision will come down to the fact that greedy_secondary_clustering requires the use of single
clustering, whereas if you run it without greedy you can use other clustering algorithms like average
. Here's some info on why that can matter - https://drep.readthedocs.io/en/latest/choosing_parameters.html#oddities-of-hierarchical-clustering
2) It should't matter much. 10,000 seems good, but I wouldn't go over 20,000 or so (after that will take a lot of RAM)
3) Correct; at the moment genome length is the only variable considered.
4) Yes you are correct- this should read "above or greater".
Yes, you're correct that centrifuge and checkm are not needed.
Yes, when doing greedy secondary clustering the order in which clusters are run is random I believe.
Best, Matt
Hi Matt,
Thanks for useful reply. If not using greedy secondary clustering, is the order in which clusters are run also random? You can see the code that the greedy secondary clustering is not included.
Thanks
Oh yes- it looks like the code was just restructured such that it's always random now. My mistake
Okay, thanks a million, Wang
Hi Matt, Just out of curiosity, "querry" is the name of one column from Ndb.csv from Drep, why not "query"? Best, Wang
Lol, because I misspelled it when I first wrote dRep in ~2016, and it's not too intertwined in my code to easily fix it :)
Okay, just take as 'query' compared to reference, right?
yeah exactly
Lol, thanks.
Hi Matt,
As for the warning, I read 'secondary clusters that were almost different alerts the user to cases where genomes are on the edge between being considered “same” or “different”. That is, if a genome is close to one of the differentiating lines in the Primary and Secondary Clustering Dendrograms shown above.' which makes sense to me. I just want to know: for example in my warning log, 'CLUSTERING WARNING: Primary cluster 313 was almost not split' means the cluster was finally split in the result; and 'CLUSTERING WARNING: Primary cluster 373 was almost split' means the cluster was not split, right? Sorry about my poor English. In addition, the warning can be ignored anyway and the clustering result is still reliable, right?
Many thanks Wang
Yes your understanding correctly. And yes you can still use the clustering as reliable even when warnings are present; I ignore the warnings in my own research
-Matt
Thanks a million, Matt. Best, Wang
Hi Matt, I have sent one email to your gmail obtained from your github homepage, which is about the paper https://www.nature.com/articles/s41467-017-02018-w, may I ask whether you know it? Thanks, Wang
Hi Matt, I just found a failure to make plot in dRep (v2.6.2) compare Step 4. Analyze as below:
making plots 1, 2, 3, 4 Plotting primary dendrogram Failed to make plot #1: Image size of 1000x607960 pixels is too large. It must be less than 2^16 in each direction.
I think this does not matter for the clustering results, right? Thanks Wang
Yes, that will not impact clustering results.
In response to your previous question, yes I know about that paper. It uses methods from the pre-inStrain days. Do you have a question about it?
Hi Matt, Thank you for your fast reply, that's great it does not impact the clustering results.
As for the other question, I have a question about the metadata for four samples, and sent it to mattolm@gmail.com, do you receive it?
Thanks Wang
Yes, that will not impact clustering results.
In response to your previous question, yes I know about that paper. It uses methods from the pre-inStrain days. Do you have a question about it?
Sorry to follow this question, as indicating to fail to make plot for primary cluster, why I still find 'Primary_clustering_dendrogram.pdf' in the folder of 'figures'?
The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind.
I don't believe I got the email. Maybe try re-sending to mattolm@stanford.edu
The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind.
I don't believe I got the email. Maybe try re-sending to mattolm@stanford.edu
So, the Primary_clustering_dendrogram.pdf in this case is not complete, right? and this failure to make plot is due to too large number of genomes?
Thanks
The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind. I don't believe I got the email. Maybe try re-sending to mattolm@stanford.edu
So, the Primary_clustering_dendrogram.pdf in this case is not complete, right? and this failure to make plot is due to too large number of genomes?
Thanks
Hi Matt, do I understand correctly? Thanks, Wang
I think the figure it not complete, but I'm not entirely sure what happens when matplotlib encounters that problem
I think the figure it not complete, but I'm not entirely sure what happens when matplotlib encounters that problem
Hi Matt, okay, thanks. I think this failure to make plot is due to too large number of genomes, right? When I run less genomes, this plot was plotted without failure.
Yes exactly
Good to learn. Anyway, the unaffected clustering results are the most important. Many thanks.
Hi Matt, May I ask a question about the memory used? If I run 30000 genomes at once (~90G in size) with 40 threads, how much memory and storage (considering the intermediate files) has to be used approximately? Thanks
Hello, I think I found a bug in dRep related to running it with the -l option less than 3000. I am trying to derep a collection of viral genomes (many of them are less than 2k) so I run derep as such:
> dRep dereplicate --S_algorithm fastANI -nc .5 -l 1000 -d -sa 0.99 -N50W 0 -sizeW 1 --ignoreGenomeQuality --clusterAlg single --multiround_primary_clustering drep_pig_virus_sa99 -g genomes/*ffn
This results in the majority of my genomes failing the fastANI step and I think its becasue the default FragmentLength for fastANI is 3000 and that should be lowered when running dRep with -l 1000 (maybe FragmentLength should be set equal to whatever is set for -l if its less than 3000?). Or there should be an option in the dRep command to set options in the fastANI command. Thanks!
Hello,
Can you please confirm you’re running the most up-to-date versions of dRep and fastANI? I remember this being an issue that I believe I fixed
-Matt
Yes, I am running the most up-to-date versions *I think of dRep (v3.4.0) and fastANI (v1.33). I installed dRep using pip install so it just installed that version by default.
Ahhh OK I see. I used to have an option to set the fragment length, but it ended up being a problem with different version of inStrain. I will try and address this in the next dRep update- thank you for bringing it to my attention
-MO
Hello, I know this is closed but I am having this weird output from dRep. dRep is telling me
...
...
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 298, in secondary_clustering
ndb = compare_genomes(bdb, algorithm, data_folder, **kwargs)
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 367, in compare_genomes
df = drep.d_cluster.external.run_pairwise_fastANI(genome_list, working_data_folder, **kwargs)
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/external.py", line 93, in run_pairwise_fastANI
exe_loc = drep.get_exe('fastANI')
File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/__init__.py", line 100, in get_exe
raise ValueError("{0} isn't working- make sure its installed".format(name))
ValueError: fastANI isn't working- make sure its installed
but fstani is there
dendropy 4.5.2 pyh3252c3a_0 bioconda
drep 3.4.0 pyhdfd78af_0 bioconda
expat 2.4.4 h295c915_0 anaconda
fastani 1.33 h0fdf51a_0 bioconda
fftw 3.3.9 h27cfd23_1
and
mash.................................... all good (location = /mnt/home/benucci/anaconda2/envs/drep/bin/mash)
nucmer.................................. all good (location = /mnt/home/benucci/anaconda2/envs/drep/bin/nucmer)
checkm.................................. all good (location = /mnt/home/benucci/anaconda2/envs/drep/bin/checkm)
ANIcalculator........................... !!! ERROR !!! (location = None)
prodigal................................ all good (location = /mnt/home/benucci/anaconda2/envs/drep/bin/prodigal)
centrifuge.............................. !!! ERROR !!! (location = None)
nsimscan................................ !!! ERROR !!! (location = None)
fastANI................................. !!! ERROR !!! (location = /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI)
what is happening? Maybe doesnt like the fastANI version? Thanks much! G.
Hi @Gian77 -
What happens when you try the command /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI -h
?
FastANI sometimes requires its own dependencies, which could be why this isn't working
-Matt
Hey @MrOlm
Thanks fro your fast answer on this.
This, what comes out, what's libgsl.so.25
?
(drep) [benucci@dev-intel18 DAS_Tool_bins]$ /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI -h
/mnt/home/benucci/anaconda2/envs/drep/bin/fastANI: error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory
Gian
Hi Gian,
This is what I suspected. This has something to do with C++ and how fastANI is written, which is something I don't fully understand.
Someone else was able to fix the problem here - https://github.com/MrOlm/drep/issues/146
And a discussion of the problem on fastANI's gitHub can be found here - https://github.com/ParBLiSS/FastANI/issues/96
Best, Matt
Thanks much! I dug this out...
I was able to make fastANI to work by installing the right gsl libraries in the conda environemnt
conda install -c conda-forge gsl=2.7=he838d99_0
dRep worked after this fix.
Gian
Following on the discussion.
Main problem: fastANI -h
return 1, and I cannot run dRep.
I have fastANI installed and it works. I try to run:
$ fastANI -q test/2838728.3.fna -r test/2838728.3.fna -o test
$ cat test
test/2838728.3.fna test/2838728.3.fna 100 353 357
But if I do:
$ fastANI -h
$ echo $?
1
Which basically means it return an error when you call it with -h
.
dRep cannot proceed, even if fastANI is working, error:
exe_loc = drep.get_exe('fastANI')
File "/home/ec2-user/Software/miniconda3/envs/drep/lib/python3.7/site-packages/drep/__init__.py", line 100, in get_exe
raise ValueError("{0} isn't working- make sure its installed".format(name))
ValueError: fastANI isn't working- make sure its installed
Tool version:
drep 3.4.5 pyhdfd78af_0 bioconda
fastani 1.1 h4ef8376_0 bioconda
Hi @AlessioMilanese -
Thanks for tracking down this issue and posting here- I very much appreciate it! I will address in the next dRep update (or please feel free to submit a pull request if you're inclined). In the meantime, it seems that updating fastANI to new versions fixes this fastANI behavior.
Thanks again! Matt
Thanks for the fast response Matt. For some reasons I cannot install a newer version of FastANI, but I will figure it out.
I will address in the next dRep update (or please feel free to submit a pull request if you're inclined)
I'm not sure what is the best solution. I think I would add an option (like -I
) to skip the part where you check if the tool is installed and working.
Hi @AlessioMilanese - yeah that would be a fine solution, or (if possible) just doing a different check for fastANI that doesn't return a 0 error code
Hi Dr. Olm,
I am using version 2.6.2, when running dRep compare with --S_algorithm fastANI, there is an error:
Clustering Step 1. Parse Arguments Clustering Step 2. Perform MASH (primary) clustering 2a. Run pair-wise MASH clustering 2b. Cluster pair-wise MASH clustering 3355 primary clusters made Step 3. Perform secondary clustering Running 8999390 fastANI comparisons- should take ~ 1200.5 min Traceback (most recent call last): File "/install/software/anaconda3.6.b/bin/dRep", line 33, in
controller.parseArguments(args)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 146, in parseArguments
self.compare_operation(vars(args))
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 91, in compare_operation
drep.d_workflows.compare_wrapper(kwargs['work_directory'],kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_workflows.py", line 96, in compare_wrapper
drep.d_cluster.d_cluster_wrapper(wd, kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 80, in d_cluster_wrapper
data_folder, wd=workDirectory, kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 215, in cluster_genomes
ndb = compare_genomes(bdb, algorithm, data_folder, kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 921, in compare_genomes
df = run_pairwise_fastANI(genome_list, working_data_folder, kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 1096, in run_pairwise_fastANI
exe_loc = drep.get_exe('fastANI')
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/init.py", line 100, in get_exe
assert False, "{0} isn't working- make sure its installed".format(name)
AssertionError: fastANI isn't working- make sure its installed
May I ask that should I install the fastANI separately? If yes, how can I make sure the dRep can call it? We already have FastANI 1.1 installed.
Thanks