Closed weiguanyue closed 1 year ago
Yes- the only issue you may encounter has to do with genome incompleteness. You can find more info on that here - https://drep.readthedocs.io/en/latest/choosing_parameters.html#importance-of-genome-completeness. Another thing you could do is skip the primary clustering and just cluster them all with fastANI, for example
-Matt
HI, Matt
Thank you very much for your answer. Here I would like to make further inquiries,
because I now have over 5,000 fungal genomes, and I want to de-redundancy it. firstly I want to de-redundancy it at the strain level, here's my code : dRep dereplicate drep99 -g .fna -p 25 -d -nc 0.3 -pa 0.9 -sa 0.99 --ignoreGenomeQuality --multiround_primary_clustering --skip_plots Then, I want to do a clustering at the species level, the script is dRep dereplicate drep99 -g .fna -p 25 -d -nc 0.3 -pa 0.9 -sa 0.95 --ignoreGenomeQuality --multiround_primary_clustering --skip_plots I would like to ask if this script is OK, the checkm is for prokaryotes, so I skip this step. At the same time, I would like to ask if there is a faster step to deal with my needs.
Best wishes Guanyue Wei
Hi @weiguanyue - those commands look good to me. The only thing to try and speed this up would be to add --S_algorithm fastANI
. This is by default on newer versions of dRep, but was not the default in previous versions.
Hopefully this shouldn't take too long- it really depends on how big the fungal genomes are.
Best, Matt
Hi, Matt
Thank you very much for your reply. Here, I have one more question:
when I use greedy algorithm, --greedy_secondary_clustering , do I need to add this parameter --run_tertiary_clustering. I am not sure whether add this parameter or not (--run_tertiary_clustering) will affect my results when using greedy algorithm, so I would like to ask you for help.
Best wishes Guanyue Wei
At 2023-06-06 01:11:07, "Matt Olm" @.***> wrote:
Hi @weiguanyue - those commands look good to me. The only thing to try and speed this up would be to add --S_algorithm fastANI . This is by default on newer versions of dRep, but was not the default in previous versions.
Hopefully this shouldn't take too long- it really depends on how big the fungal genomes are.
Best, Matt
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Hi @weiguanyue -
Yeah, when using the parameter --greedy_secondary_clustering
it's also recommended to add --run_tertiary_clustering
. This just ensures that you don't accidentally over-split clusters.
-MO
Hi, Matt
Sorry to bother you again. I am now using drep to de-redundant a batch of genomes, because the genomes are too large and there are too many of them, so I am doing it in batches. A few groups worked and successfully, however, some groups had problems. such as :
FileNotFoundError: [Errno 2] No such file or directory: '/06-1000drep99/data/greedy_clustering/fastANI_out_uypngsdoua'
or when run in --run_tertiary_clustering
..:: dRep dereplicate Step 1. Filter ::..
Will filter the genome list 645 genomes were input to dRep Calculating genome info of genomes 100.00% of genomes passed length filtering
..:: dRep dereplicate Step 2. Cluster ::..
Running primary clustering Running pair-wise MASH clustering 136 primary clusters made Running secondary clustering Running 70819 fastANI comparisons- should take ~ 43.1 min Step 4. Return output
..:: dRep dereplicate Step 3. Choose ::..
Loading work directory Calculating centrality using Mash
..:: dRep dereplicate Step 4. Evaluate ::..
Running tertiary clustering on genome representatives Running primary clustering Running pair-wise MASH clustering 123 primary clusters made Running secondary clustering Running 421 fastANI comparisons- should take ~ 10.6 min FileNotFoundError: [Errno 2] No such file or directory: '/08-645drep99/data/tertiary_clustering/data/fastANI_files/fastANI_out_zdxfyzlrad
In general, this situation has been running for several days, the first case does not get a de-redundant genome, and the second case reports a de-redundant genome. I have thought for a long time to find the reason, I wonder if you can provide some help, I would appreciate it.
Best wishes Guanyue
At 2023-06-13 00:28:46, "Matt Olm" @.***> wrote:
Hi @weiguanyue -
Yeah, when using the parameter --greedy_secondary_clustering it's also recommended to add --run_tertiary_clustering. This just ensures that you don't accidentally over-split clusters.
-MO
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Hi @weiguanyue - huh, that's a bug I haven't encountered before. It could be a problem with fastANI not being able to handle large genomes? The two things I can think to do are 1) update fastANI and 2) if that doesn't solve it, using another algorithm (like ANImf) instead.
Best, Matt
I want to remove redundancy from the eukaryotic genomes, Can I skip secondary clustering and only run MASH clustering ?