MrOlm / drep

Rapid comparison and dereplication of genomes
242 stars 35 forks source link

fastANI is not working #96

Open Biofarmer opened 3 years ago

Biofarmer commented 3 years ago

Hi Dr. Olm,

I am using version 2.6.2, when running dRep compare with --S_algorithm fastANI, there is an error:

Clustering Step 1. Parse Arguments Clustering Step 2. Perform MASH (primary) clustering 2a. Run pair-wise MASH clustering 2b. Cluster pair-wise MASH clustering 3355 primary clusters made Step 3. Perform secondary clustering Running 8999390 fastANI comparisons- should take ~ 1200.5 min Traceback (most recent call last): File "/install/software/anaconda3.6.b/bin/dRep", line 33, in controller.parseArguments(args) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 146, in parseArguments self.compare_operation(vars(args)) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 91, in compare_operation drep.d_workflows.compare_wrapper(kwargs['work_directory'],kwargs) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_workflows.py", line 96, in compare_wrapper drep.d_cluster.d_cluster_wrapper(wd, kwargs) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 80, in d_cluster_wrapper data_folder, wd=workDirectory, kwargs) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 215, in cluster_genomes ndb = compare_genomes(bdb, algorithm, data_folder, kwargs) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 921, in compare_genomes df = run_pairwise_fastANI(genome_list, working_data_folder, kwargs) File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 1096, in run_pairwise_fastANI exe_loc = drep.get_exe('fastANI') File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/init.py", line 100, in get_exe assert False, "{0} isn't working- make sure its installed".format(name) AssertionError: fastANI isn't working- make sure its installed

May I ask that should I install the fastANI separately? If yes, how can I make sure the dRep can call it? We already have FastANI 1.1 installed.

Thanks

MrOlm commented 3 years ago

Yes, you need to install it separately and need to make sure it's in your system PATH. dRep bonus test --check_dependencies will tell you which programs dRep is able to confirm are working correctly.

Biofarmer commented 3 years ago

Hi Dr. Olm, thanks for reply. I just runt the code and see:

mash.................................... all good (location = /install/software/anaconda3.6.b/bin/mash) nucmer.................................. all good (location = /install/software/anaconda3.6.b/bin/nucmer) checkm.................................. all good (location = /install/software/anaconda3.6.b/bin/checkm) ANIcalculator........................... !!! ERROR !!! (location = None) prodigal................................ all good (location = /install/software/anaconda3.6.b/bin/prodigal) centrifuge.............................. all good (location = /install/software/anaconda3.6.b/bin/centrifuge) nsimscan................................ !!! ERROR !!! (location = None) fastANI................................. !!! ERROR !!! (location = None)

May I ask that if I want to use fastANI for S_algorithm with dRep compare or dereplicate, do all the three "ERROR" stools need to be installed? For sure, fastANI yes. However about the other two? If yes, are the following links correct to download? For ANIcalculator, is it here: https://ani.jgi.doe.gov/html/download.php? For nsimscan, is it here: https://github.com/abadona/qsimscan ?

Many thanks for reply.

MrOlm commented 3 years ago

Hello,

Nope, to use fastANI you only need to correct the fastANI option. The other two are used for other S_algorithms

Biofarmer commented 3 years ago

Hi Dr. Olm, it is great to know that. I have only fixed the fastANI option, and now looks good. mash.................................... all good (location = /install/software/anaconda3.6.b/bin/mash) nucmer.................................. all good (location = /install/software/anaconda3.6.b/bin/nucmer) checkm.................................. all good (location = /install/software/anaconda3.6.b/bin/checkm) ANIcalculator........................... !!! ERROR !!! (location = None) prodigal................................ all good (location = /install/software/anaconda3.6.b/bin/prodigal) centrifuge.............................. all good (location = /install/software/anaconda3.6.b/bin/centrifuge) nsimscan................................ !!! ERROR !!! (location = None) fastANI................................. all good (location = /install/software/fastani/1.32/fastANI)

I have run dRep compare with fastANI, and find that the flag of fastANI '--minFraction' is 0, but the default is 0.2. May I ask that the '--minFraction' is indeed set to 0? or do I misunderstand anything?

11-09 17:51 DEBUG running cluster 1 11-09 17:51 DEBUG /install/software/fastani/1.32/fastANI --ql /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/tmp/genomeList --rl /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/tmp/genomeList -o /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/fastANI_out_tqcrxtgsrr --matrix -t 50 --minFraction 0 tqcrxtgsrr

Thanks

MrOlm commented 3 years ago

Hello,

Yes, the --minFraction is not used at the stage of the program being referenced in the log. It's used at a later stage when the clustering happens. The default is indeed 0.2, not 0.

Biofarmer commented 3 years ago

Hi, thanks for reply. It is good to have the default value used. I will see it when the running is finished. Thanks again.

Biofarmer commented 3 years ago

Hi Dr. Olm, I have run with drep compare with fastANI with 50 processors, and it shows in the log that 1200.5 min is needed as below. 11-09 17:51 INFO 3355 primary clusters made 11-09 17:51 INFO Step 3. Perform secondary clustering 11-09 17:51 INFO Running 8999390 fastANI comparisons- should take ~ 1200.5 min

However, it has been running already 45 h (2700 min), and I can see that the job is still running, however may I ask that the time indicated in the log is just estimated and the real time may be longer than that?

Thanks

MrOlm commented 3 years ago

Hello,

Yes it is indeed just an estimate. Depending on random factors like genome size, the specifics of the genomes being compared, and the number of cores being used, the actual time can vary a bit. The parallelization also doesn't scale perfectly; 50 cores will take more than twice as long as 25 cores, but the simple formula used to estimate how long the job will take doesn't take this into account.

Best, Matt

Biofarmer commented 3 years ago

Hi Dr. Olm, It is good that I have finished my run with 20,000 genomes, with -pa 0.9 -sa 0.95 -nc 0.30 -cm larger.

  1. However there is an error for a cluster:

Running 8999390 fastANI comparisons- should take ~ 1200.5 min CRITICAL ERROR WITH SECONDARY CLUSTERING CODE hdzdcskeha; SKIPPING CRITICAL ERROR WITH PRIMARY CLUSTER 592; TRYING AGAIN CRITICAL ERROR WITH SECONDARY CLUSTERING CODE geyhqnsofz; SKIPPING DOUBLE CRITICAL ERROR AGAIN WITH PRIMARY CLUSTER 592; SKIPPING

In the Cdb.csv, there is no cluster 592 record. May I ask why?

  1. For dRep compare Step 4. Analyze part, there are many lines as below in my error log:

Plotting MDS plot /install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:130: RuntimeWarning : divide by zero encountered in double_scalars old_stress = stress / dis /install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:125: RuntimeWarning : invalid value encountered in double_scalars if(old_stress - stress / dis) < eps: /install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:130: RuntimeWarning : invalid value encountered in double_scalars old_stress = stress / dis /install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:125: RuntimeWarning : invalid value encountered in double_scalars if(old_stress - stress / dis) < eps: ...

May I ask why? I think these issues are without any impact on the cluster, and the files in data_tables are totally fine for subsequence analysis, right?

  1. The value (0.30) I set with -nc is applied to --minFraction in fastANI? Because I need the alignment fraction (AF) for subsequent analysis. Where can I check the value used for --minFraction in fastANI?

Sorry about many questions and thanks for your time.

MrOlm commented 3 years ago

Hello,

1) This is because FastANI is a somewhat finicky program, and it didn't like something about one of those genomes. Try and run fastANI on those genomes outside of dRep to see what the problem is. dRep runs FastANI like below, where genomeList is a text file listing the genome locations.

$ fastANI --ql genomeList --rl genomeList -o fastANI_out_pqbxqtvbuy --matrix -t 8 --minFraction 0 pqbxqtvbuy

Alternatively you could run dRep again in debug mode (add -d), which will store all logs that FastANI creates to show what the error is (stored in log/cmd_lods/). I must warn you though, often fastANI crashes without an easily interpretable error, which makes it difficult to know what the problem is.

2) Yeah don't worry about those; it's just matplotlib yelling about unimportant things that might be problems but really aren't.

3) No; -nc is applied during the clustering step. You can see the AF values for all comparisons in the Ndb.csv file in the data_tables folder. During the clustering step if the AF is less than -nc, dRep will replace the actual ANI value with a 0. This 0 isn't stored in Ndb.csv so that you can see what the reported ANI is, when the clustering happens it will be treated as a 0.

Best, Matt

Biofarmer commented 3 years ago

Hi Dr. Olm,

Thanks for reply. May I further ask based on your reply?

  1. Where can I find the genomes that are listed in Cluster 592? I just see one genomeList lin tmp folder. Then I can run fastANI independently to see the problem.

  2. That's great, thanks.

  3. Based on my understanding, the AF used for secondary clustering is from -nc, right? When running fastANI, the --minFraction is set to 0, and the AF is then calculated based one the output of fastANI for each comparsion. Then dRep used -nc to do secondary clustering based on the calculated AF, right?

Thanks.

MrOlm commented 3 years ago

Hello,

1) It should just be the genomes that are in the datatable Bdb.csv (which contains all filtered genomes) but are not in Cdb.csv (which has the cluster assignments of all genomes).

3) That is mostly correct. You are correct and AF is reported and calculated by fastANI with --minFraction set to 0. dRep uses ANI to do the secondary clustering, however, not AF. All genomes with an AF below the nc have their ANI set to 0 by dRep, though.

-Matt

Biofarmer commented 3 years ago

Hi Dr. Olm,

Thank you for reply.

  1. Yes, but all the genomes are in the datatable Bdb.csv, right? How can I know which genomes that belong to Cluster 592? I think I only need to re-run fastANI on this cluster to check where the problem is, am I right?

  2. Thanks, it makes sense to me now.

Sorry may I introduce a new question: 3.1 Is the final number of clusters from column of "secondary_cluster" in Cdb? I have checked the unique cluster number from "secondary_cluster" is 4546, and the unique cluster from "primary_cluster" is 3354. I think this is expected to have higher cluster number from "secondary_cluster", right? 3.2 I run code with -pa 0.95 -sa 0.95 -nc 0.30 -cm larger for the same genomes. I am just curious that how many clusters from Mash with 0.95, which results in 4081 clusters. I understand the problem of Mash is to underestimate the distance between the incomplete genomes and split same-species genomes into multiple clusters, which means higher number of clusters from Mash itself. However, why the number I got with -pa 0.95 from Mash (the first step of dRep) is lower than the clusters I got from dRep with -sa 0.95? Or I cannot compare them directly like this?

Thanks

MrOlm commented 3 years ago

Hello,

1) The genomes that ARE in Bdb.csv and ARE NOT in Cdb.csv will be the ones where fastANI failed.

3.1) Yes, your understanding is correct.

3.2) I'm a little confused by the question but can explain a few things. The purpose of primary clustering (done with Mash, which is fast) is to reduce the number of comparisons that have to be done with fastANI (which is slow). The only genomes that are ever compared with FastANI are those that are in the same primary cluster by Mash; genomes in different primary clusters will never be in the same secondary clusters. Mash is not very accurate however (especially with incomplete genomes), so that's why by default dRep casts a wide net with Primary clustering, and then makes secondary clusters with fastANI.

If you'd like to compare Mash and fastANI directly, you'll need to run dRep once with the command -pa 0.95 --SkipMash and once with the command -sa 0.95 --SkipSecondary. I'll warn you though, this second command will be very slow (because it will not have the primary clustering speeding things up).

A bit more detail on this can be found here: https://drep.readthedocs.io/en/latest/choosing_parameters.html

-Matt

Biofarmer commented 3 years ago

Hi Dr. Olm,

  1. sorry, I understood now, and found the cluster 592 only has one genome, and run the fastANI as below: fastANI --ql genomeList --rl genomeList -o fastANI_out_592 --matrix -t 50 --minFraction 0

It run without error:

Reference = [bin.7.fna] Query = [bin.7.fna] Kmer size = 16 Fragment length = 3000 Threads = 50 ANI output file = fastANI_out_592

INFO [thread 0], skch::main, Count of threads executing parallel_for : 50 INFO [thread 0], skch::Sketch::build, window size for minimizer sampling = 24 INFO [thread 18], skch::main, ready to exit the loop INFO [thread 1], skch::main, ready to exit the loop INFO [thread 24], skch::main, ready to exit the loop INFO [thread 17], skch::main, ready to exit the loop INFO [thread 47], skch::main, ready to exit the loop INFO [thread 23], skch::main, ready to exit the loop INFO [thread 41], skch::main, ready to exit the loop INFO [thread 19], skch::main, ready to exit the loop INFO [thread 33], skch::main, ready to exit the loop INFO [thread 43], skch::main, ready to exit the loop INFO [thread 39], skch::main, ready to exit the loop INFO [thread 12], skch::main, ready to exit the loop INFO [thread 2], skch::main, ready to exit the loop INFO [thread 15], skch::main, ready to exit the loop INFO [thread 46], skch::main, ready to exit the loop INFO [thread 44], skch::main, ready to exit the loop INFO [thread 21], skch::main, ready to exit the loop INFO [thread 34], skch::main, ready to exit the loop INFO [thread 7], skch::main, ready to exit the loop INFO [thread 6], skch::main, ready to exit the loop INFO [thread 22], skch::main, ready to exit the loop INFO [thread 38], skch::main, ready to exit the loop INFO [thread 30], skch::main, ready to exit the loop INFO [thread 25], skch::main, ready to exit the loop INFO [thread 3], skch::main, ready to exit the loop INFO [thread 42], skch::main, ready to exit the loop INFO [thread 4], skch::main, ready to exit the loop INFO [thread 45], skch::main, ready to exit the loop INFO [thread 29], skch::main, ready to exit the loop INFO [thread 36], skch::main, ready to exit the loop INFO [thread 14], skch::main, ready to exit the loop INFO [thread 10], skch::main, ready to exit the loop INFO [thread 9], skch::main, ready to exit the loop INFO [thread 32], skch::main, ready to exit the loop INFO [thread 5], skch::main, ready to exit the loop INFO [thread 37], skch::main, ready to exit the loop INFO [thread 28], skch::main, ready to exit the loop INFO [thread 8], skch::main, ready to exit the loop INFO [thread 31], skch::main, ready to exit the loop INFO [thread 49], skch::main, ready to exit the loop INFO [thread 13], skch::main, ready to exit the loop INFO [thread 35], skch::main, ready to exit the loop INFO [thread 16], skch::main, ready to exit the loop INFO [thread 26], skch::main, ready to exit the loop INFO [thread 48], skch::main, ready to exit the loop INFO [thread 27], skch::main, ready to exit the loop INFO [thread 40], skch::main, ready to exit the loop INFO [thread 20], skch::main, ready to exit the loop INFO [thread 11], skch::main, ready to exit the loop INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 118228 INFO [thread 0], skch::Sketch::index, unique minimizers = 115202 INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 112667) ... (14, 1) INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup. INFO [thread 0], skch::main, Time spent sketching the reference : 0.161255 sec INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.0048895 sec INFO [thread 0], skch::main, Time spent post mapping : 1.052e-06 sec INFO [thread 0], skch::main, ready to exit the loop INFO, skch::main, parallel_for execution finished

However, there is nothing in fastANI_out_592 output. I think there should be one comparison inside.

I run another cluster also with only one genome that was good with dRep, there is one comparison in fastANI_out, and the log is same as above. I am confused, is it due to the bin file itself? Could you have a look where the problem is?

Thanks

MrOlm commented 3 years ago

...interesting. Are you sure that cluster 592 only has a single genome in it? The size of Bdb.csv (measured with the command cat Bdb.csv | wc -l) is only 1 line longer than Cdb.csv (assessed with the command cat Cdb.csv | wc -l)? The reason I ask is that secondary comparisons are not usually run on primary clusters that only have a single genome in them, so it's a bit surprising to me.

Thank you, Matt

Biofarmer commented 3 years ago

Hi Dr. Olm,

Yes, there is only one genome difference between Bdb.csv (200001) and Cdb.csv (20000). I run 20000 genomes together, plus the title line, that's why the Bdb.csv is with 200001 lines. Indeed, I have so many fastANI output with only one genome in /data/fastANI_files/ folder.

The top 10 lines from Cdb.csv:

genome | secondary_cluster | threshold | cluster_method | comparison_algorithm | primary_cluster
-- | -- | -- | -- | -- | --
GCF_000711905.1_ASM71190v1_genomic.fna | 1_0 | 0.05 | average | fastANI | 1
GCF_002002665.1_ASM200266v1_genomic.fna | 2_0 | 0.05 | average | fastANI | 2
GCF_902158755.1_SYNGOMJ08_genomic.fna | 3_0 | 0.05 | average | fastANI | 3
GCF_000008665.1_ASM866v1_genomic.fna | 4_1 | 0.05 | average | fastANI | 4
GCF_000734035.1_ASM73403v1_genomic.fna | 4_1 | 0.05 | average | fastANI | 4
GCF_001610875.1_ASM161087v1_genomic.fna | 5_0 | 0.05 | average | fastANI | 5
GCF_001654775.1_ASM165477v1_genomic.fna | 6_0 | 0.05 | average | fastANI | 6
GCF_002212725.1_ASM221272v1_genomic.fna | 7_0 | 0.05 | average | fastANI | 7
GCF_006740685.1_ASM674068v1_genomic.fna | 8_0 | 0.05 | average | fastANI | 8
GCF_014078535.1_ASM1407853v1_genomic.fna | 9_0 | 0.05 | average | fastANI | 9
GCF_013407165.1_ASM1340716v1_genomic.fna | 10_0 | 0.05 | average | fastANI | 10

The version of fastANI I used is 1.32, and the version of dRep is 2.6.2, and run dRep bonus test --check_dependencies

Loading work directory
Checking dependencies
mash.................................... all good        (location = /install/software/anaconda3.6.b/bin/mash)
nucmer.................................. all good        (location = /install/software/anaconda3.6.b/bin/nucmer)
checkm.................................. all good        (location = /install/software/anaconda3.6.b/bin/checkm)
ANIcalculator........................... !!! ERROR !!!   (location = None)
prodigal................................ all good        (location = /install/software/anaconda3.6.b/bin/prodigal)
centrifuge.............................. all good        (location = /install/software/anaconda3.6.b/bin/centrifuge)
nsimscan................................ !!! ERROR !!!   (location = None)
fastANI................................. all good        (location = /install/software/fastani/1.32/fastANI)

Sorry about the new problem, and thanks a million for your time.

MrOlm commented 3 years ago

Hello,

OK interesting- thank you for pointing this out to me. You are indeed correct and apologies for not understanding how my own program works.

Unfortunately though I don't think I'll be able to help with this problem, since it seems to be an issue with FastANI. You can try and post an issue on the FastANI page (https://github.com/ParBLiSS/FastANI/issues), but they don't seem to be answered very often.

However, in future versions of dRep I will have it assign genomes that are in a primary cluster by themselves to a cluster without needing to run fastANI.

Best, Matt

Biofarmer commented 3 years ago

Hi Dr. Olm,

  1. So do you mean that secondary comparisons are run on primary clusters that only have a single genome in them, and what I have run and the output is correct, right?

  2. As for the problem I met with the cluster 592, I should ask on the FastANI page, right? I will do that. However, I think if only one genome is in this cluster, actually I do not need to run secondary comparison, and just take it as its cluster 592. Am I right?

Thanks

MrOlm commented 3 years ago

Hello,

Yes, you are correct on both points. FastANI failing doesn't matter in this case, because no matter what that genome would be cluster 592.

-Matt

Biofarmer commented 3 years ago

Hi Dr. Olm,

Thank you very much for reply fast and now it does make sense to me.

Thanks again.

Biofarmer commented 3 years ago

Hi Dr. Olm,

FYI, I just searched the answer to my fastANI problem, and I found it here https://github.com/ParBLiSS/FastANI/issues/21. I also changed the fragLen like 200, this genome worked with fastANI.

A quite small question about the command of fastANI in dRep, like below, the last 10 letters are just to distinguish the output of fastANI in fastANI_files folder, making no any effect on the process of fastANI itself, right?

$ fastANI --ql genomeList --rl genomeList -o fastANI_out_pqbxqtvbuy --matrix -t 8 --minFraction 0 pqbxqtvbuy

Thanks

MrOlm commented 3 years ago

Interesting. Thank you for pointing this out to me.

Yes, the last 10 letters are randomly generated and just used to distinguish output files.

-Matt

Biofarmer commented 3 years ago

Hi Dr. Olm,

Thank you for reply. Anyway, this is not a problem in the case of one genome in a cluster. That's great, will run all my genomes.

Thanks again.

Biofarmer commented 3 years ago

Hi Dr. Olm,

One question from the output of run 30,000 genome together, with

dRep compare output -p 50 -pa 0.90 -sa 0.95 -nc 0.30 -cm larger --S_algorithm fastANI -g genome_path.txt

There is no error reported for Step 1. Cluster

***************************************************
    ..:: dRep compare Step 1. Cluster ::..
***************************************************

Loading genomes from a list
Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
4362 primary clusters made
Step 3. Perform secondary clustering
Running 21641432 fastANI comparisons- should take ~ 2887.0 min
Step 4. Return output
***************************************************
    ..:: dRep compare Step 2. Bonus ::..
***************************************************

However, in the Cdb, one genome is missing compared to Bdb (there are 30,000 genomes). I also checked the unique primary cluster of Cdb, it is correct with 4362. I also checked the number of comparison in Mdb, it is correct with 59,999, however, there is no record in Ndb.

  1. Do you know why? may be still the FastANI issue as discussed above. In addition, is it possible to check which cluster this genome is from?

  2. If I leave this genome out, is the output of current clustering in Cdb still reliable?

Thanks Wang

Biofarmer commented 3 years ago

Hi Dr. Olm,

I updated my question, can you look at it when you are available? Thanks

MrOlm commented 3 years ago

Hello,

This is an interesting problem and not something that I've encountered before. I don't see any way that the problem genome could impact anything else, so I believe the rest of the clustering output is still correct.

I think this is probably an issue with FastANI, though it's not clear to me what the issue is. The only way I can think of to tell which primary cluster the missing genome is in is to re-run the program with the --debug flag. This will create an additional file called CdbF.csv in the data_tables folder, which contains the clustering information for all genomes with primary clustering only. You can also run this additional step with the --SkipSecondary flag to make it run much faster if you'd like.

Best, Matt

Biofarmer commented 3 years ago

Hi Matt,

Thanks for reply. It is good that all rest clusters are fine.

Best, Wang

Biofarmer commented 3 years ago

Hi Matt,

When I run ~50,000 genomes, and there is an error:

File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/pandas/core/reshape/reshape. py", line 124, in __init__
    raise ValueError('Unstacked DataFrame is too big, '
ValueError: Unstacked DataFrame is too big, causing int32 overflow

I think this is due to the large number of genomes, right? Is there any other way to solve this problem without reducing the genome number? In addition, I can shorten my genome names with half length.

Thanks Wang

MrOlm commented 3 years ago

Unfortunately this error cannot be overcome with the public version of dRep. The only option is to run it in two chunks of ~25,000 genomes, and then de-replicate the results of those two runs.

I am working on the next version of dRep that will fix the problem. It's not quite finished yet, but if you'd like to give it a shot you can install it with the command pip install drep --upgrade --pre. The latest development version is v3.0.0.dev6.

When on the development version, you can add the flag --multiround_primary_clustering which will make it run mush faster and work much better overall for large genome sets like you have.

Biofarmer commented 3 years ago

Hi Matt,

Thank you very much.

Biofarmer commented 3 years ago

Hi Matt,

May I ask you a question? For example, I have 1000 genomes and run with dRep compare

-pa 0.90 -sa 0.95 -nc 0.30 -cm larger --S_algorithm fastANI

Then if I removed a number of genomes from 1000, and run the same code with the rest genomes. In the output, the total number of resulting clusters of the rest genomes should be same as the number of clusters when they were clustered with 1000 genomes, right?

In addition, in Ndb.csv, I found that for example genome A and B from same primary cluster, with 0.973602 (ani) | 0.4794189 (af) and 0.966475 | 0.5440000 for secondary cluster, I think they are be in the same secondary cluster, but why are they given different secondary cluster number? The cluster_method in Ndb.csv is "average".

Thanks

MrOlm commented 3 years ago

Hello,

I think the cause of both of these problems is probably coming from the "average" clustering method. By default, hierarchical clustering is performed in dRep using average linkage (https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) which can result in clustering changing depending on the genome set used. If you want all cases where two genomes are over your thresholds to be in the same cluster, you can run it in single mode (--clusterAlg single). The problem is that this can create big clusters- for example if A is similar to B, and B is similar to C, but A and C are not similar, what do you do? In single mode A, B, and C will be in the same cluster, and in average mode some averaging is done to try and handle this.

dRep can use any method of linkage listed at the following webpage by using the --clusterAlg argument: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

I would also encourage you to look at the secondary clustering plot for the primary cluster in question. That can show you the relationship between all of the genomes in that primary cluster, and visualize the average linkage that let to those genome being in different secondary clusters.

Best, Matt

Biofarmer commented 3 years ago

Hi Matt,

Thank you for your reply. It makes sense to me. Another practical question with very large set of genomes, for example more than 50,000, if splitting the genome into two parts, and then cluster independently to get the representatives of each part. Afterwards, do cluster the representatives of the two parts to get the final clusters of the whole set genomes. It is still reliable, right? Just wondering whether the final clusters will be slightly different from that if clustering the whole genome set (I know you mentioned it is not possible at the moment).

Thanks

MrOlm commented 3 years ago

Hello,

Yes, doing what you suggested still leads to reliable results. Due to the complications of hierarchical clustering (as described in my last post) the final clustering will likely be slightly different from that if run on the whole genome set, but it will largely be the same and still valid in my opinion. It's also worth noting that other high-profile studies have split their genomes in this way as well (for example https://www.nature.com/articles/s41587-020-0603-3)

Best, Matt

Biofarmer commented 3 years ago

Hi Matt,

Thanks a million. How kind and fast you are. I also noticed that paper, which is very nice.

Thanks again. Wang

Biofarmer commented 3 years ago

Hi Matt,

May I ask you a question about the annotation clusters by GTDB-Tk v1.3.0 (https://github.com/Ecogenomics/GTDBTk). After clustering with -pa 0.9 -sa 0.95 -nc 0.30 -S_algorithm fastANI, I found that some clusters (typically for those shared the primary cluster but different secondary cluster) are annotated with the same species from GTDB-Tk. GTDB-Tk uses generally 95% ANI and 65% AF to cluster for annotation. Do you even use GTDB or do you know why?

Many thanks Wang

MrOlm commented 3 years ago

Hi Wang,

Yes, I do use GTDB and like it in general. Their species clusters range from 95% - 97% ANI radiuses, but if you were to cluster a lot of genomes using 95% ANI, you would generally get the same species calls as GTDB. The reason for 95% is that this roughly corresponds to species-level groups (https://msystems.asm.org/content/5/1/e00731-19.abstract)

Best, Matt

Biofarmer commented 3 years ago

Hi Matt,

Many thanks for your reply. I understand now why 95% is selected as boundary to cluster species, and it is a nice paper to read. As you said GTDB also uses 95-97% ANI radiuses to cluster species, and if we use 95% as well in dRep to cluster a lot of MAGs. Why are the resulting MAGs clusters at 95% ANI assigned the same species from GTDB with the same threshold (95%-97% ANI)? Is it due the very high number of MAGs (more than thousands) or any other reasons?

Best, Wang

MrOlm commented 3 years ago

Hi Wang,

dRep and GTDB use almost the same threshold and method (95% ANI + fastANI), so they give almost the same results. Any differences are likely the result of random differences in which genomes end up being the "representative genome" around which the ANI radius is draw with GTDB, and also some oddities regarding hierarchical clustering with dRep (see here for more info - https://drep.readthedocs.io/en/latest/choosing_parameters.html#oddities-of-hierarchical-clustering)

Best, Matt

Biofarmer commented 3 years ago

Hi Matt,

Many thanks for your reply. If like this, those clusters from dRep can be still be considered different clusters although they are annotated as same species from GTDB, right?

MrOlm commented 3 years ago

Yes, I think so. That is what I do in my own research at least.

Best, Matt

Biofarmer commented 3 years ago

Hi Matt, many thanks, it makes sense to me as well. Best, Wang

Biofarmer commented 3 years ago

Hi Matt,

I just noticed that the version 3.0.0 is available. May I ask what is the improvement with this new version, what the maximum number of genomes can be compared together? In addition, how much time will it take approximately with the maximum number of genomes? Many thanks.

Best, Wang

MrOlm commented 3 years ago

Hello,

Here's a list of the improvements with each version - https://github.com/MrOlm/drep/blob/master/CHANGELOG.md

There's no hard-set "maximum" number of genomes. It depends on the computational resources available, how similar the genomes are in your dataset, etc. For a ball-park estimate, though, you could probably run 500,000 genomes on a machine with 1Tb of RAM and 40 cores over the course of a few days, if you used both of the greedy clustering algorithms https://drep.readthedocs.io/en/latest/choosing_parameters.html#using-greedy-algorithms

-Matt

Biofarmer commented 3 years ago

Hi Matt,

Many thanks for reply. It means that I can run all my genomes for example 300,000 genomes at once with 40 cores. I have carefully read the greedy clustering algorithms. If I use fastANI as S_algorithm, this speed will be almost same with or without greedy_secondary_clustering, right? If so, it does not matter to include greedy_secondary_clustering, right? Flags of –multiround_primary_clustering and –run_tertiary_clustering for sure will be included.

Thanks Wang

MrOlm commented 3 years ago

Hello,

When you're working with datasets this big, you'll probably have to use the greedy_secondary_clustering. The paper describing fastANI says it can do pairwise comparisons really fast, but in my hands it's too slow to be practical. It all depends on how big your primary clusters end up being, but if you have clusters over ~500 or so, fastANI becomes really slow in my hands. greedy_secondary_clustering helps a lot in those cases

-Matt

Biofarmer commented 3 years ago

Hi Matt,

  1. Thank you very much for your advice. May I ask how is the precision if including all '''–multiround_primary_clustering, -greedy_secondary_clustering and –run_tertiary_clustering''' compared to dividing the genomes into several chunks, like 150,000 and 150,000 as discussed as above?

  2. As for the primary_chunksize, if clustering for example 300,000, the value should be changed to a bit higher, like 10,000, which will be better than the default or it does not matter?

  3. In greedy_secondary_clustering, the completeness and contamination, etc. of the the genome are not taken into account when it is randomly chosen to represent a cluster, right?

  4. A bit confuse about the word, in the instruction, saying "If it’s below ANI thresholds with that genome, it will be put in that cluster. If it’s not, it will be put into a new cluster and made the representative genome of the new cluster." Should "below" be "above or greater"?

Thanks Wang

Biofarmer commented 3 years ago

Hi Matt,

I have installed the newest version of dRep 3.2.0. After running dRep check_dependencies, it results as below:

mash.................................... all good        (location = /install/software/mincondas/3.7/instance3/bin/mash)
nucmer.................................. all good        (location = /install/software/mincondas/3.7/instance3/bin/nucmer)
checkm.................................. !!! ERROR !!!   (location = None)
ANIcalculator........................... !!! ERROR !!!   (location = None)
prodigal................................ all good        (location = /install/software/mincondas/3.7/instance3/bin/prodigal)
centrifuge.............................. !!! ERROR !!!   (location = None)
nsimscan................................ !!! ERROR !!!   (location = None)
fastANI................................. all good        (location = /install/software/mincondas/3.7/instance3/bin/fastANI)

May I ask if I run drep compare with fastANI as secondary cluster, whether the "error" is necessary? I know ANIcalculator and nsimscan are not necessary for fastANI, how about checkm and centrifuge?

Thanks Wang

Biofarmer commented 3 years ago

Hi Matt,

I have installed the newest version of dRep 3.2.0. After running dRep check_dependencies, it results as below:

mash.................................... all good        (location = /install/software/mincondas/3.7/instance3/bin/mash)
nucmer.................................. all good        (location = /install/software/mincondas/3.7/instance3/bin/nucmer)
checkm.................................. !!! ERROR !!!   (location = None)
ANIcalculator........................... !!! ERROR !!!   (location = None)
prodigal................................ all good        (location = /install/software/mincondas/3.7/instance3/bin/prodigal)
centrifuge.............................. !!! ERROR !!!   (location = None)
nsimscan................................ !!! ERROR !!!   (location = None)
fastANI................................. all good        (location = /install/software/mincondas/3.7/instance3/bin/fastANI)

May I ask if I run drep compare with fastANI as secondary cluster, whether the "error" is necessary? I know ANIcalculator and nsimscan are not necessary for fastANI, how about checkm and centrifuge?

Thanks Wang

I have read that the centrifuge has been deprecated and checkm is not necessary for drep compare module. Thanks.