MrOlm / drep

Rapid comparison and dereplication of genomes
263 stars 37 forks source link

Remove redundancy from the eukaryotic genomes #195

Closed weiguanyue closed 1 year ago

weiguanyue commented 1 year ago

I want to remove redundancy from the eukaryotic genomes, Can I skip secondary clustering and only run MASH clustering ?

MrOlm commented 1 year ago

Yes- the only issue you may encounter has to do with genome incompleteness. You can find more info on that here - https://drep.readthedocs.io/en/latest/choosing_parameters.html#importance-of-genome-completeness. Another thing you could do is skip the primary clustering and just cluster them all with fastANI, for example

-Matt

weiguanyue commented 1 year ago

HI, Matt

Thank you very much for your answer. Here I would like to make further inquiries,

because I now have over 5,000 fungal genomes, and I want to de-redundancy it. firstly I want to de-redundancy it at the strain level, here's my code : dRep dereplicate drep99 -g .fna -p 25 -d -nc 0.3 -pa 0.9 -sa 0.99 --ignoreGenomeQuality --multiround_primary_clustering --skip_plots Then, I want to do a clustering at the species level, the script is dRep dereplicate drep99 -g .fna -p 25 -d -nc 0.3 -pa 0.9 -sa 0.95 --ignoreGenomeQuality --multiround_primary_clustering --skip_plots I would like to ask if this script is OK, the checkm is for prokaryotes, so I skip this step. At the same time, I would like to ask if there is a faster step to deal with my needs.

Best wishes Guanyue Wei

MrOlm commented 1 year ago

Hi @weiguanyue - those commands look good to me. The only thing to try and speed this up would be to add --S_algorithm fastANI . This is by default on newer versions of dRep, but was not the default in previous versions.

Hopefully this shouldn't take too long- it really depends on how big the fungal genomes are.

Best, Matt

weiguanyue commented 1 year ago

Hi, Matt

Thank you very much for your reply. Here, I have one more question:

when I use greedy algorithm, --greedy_secondary_clustering , do I need to add this parameter --run_tertiary_clustering. I am not sure whether add this parameter or not (--run_tertiary_clustering) will affect my results when using greedy algorithm, so I would like to ask you for help.

Best wishes Guanyue Wei

At 2023-06-06 01:11:07, "Matt Olm" @.***> wrote:

Hi @weiguanyue - those commands look good to me. The only thing to try and speed this up would be to add --S_algorithm fastANI . This is by default on newer versions of dRep, but was not the default in previous versions.

Hopefully this shouldn't take too long- it really depends on how big the fungal genomes are.

Best, Matt

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

MrOlm commented 1 year ago

Hi @weiguanyue -

Yeah, when using the parameter --greedy_secondary_clustering it's also recommended to add --run_tertiary_clustering. This just ensures that you don't accidentally over-split clusters.

-MO

weiguanyue commented 1 year ago

Hi, Matt

Sorry to bother you again. I am now using drep to de-redundant a batch of genomes, because the genomes are too large and there are too many of them, so I am doing it in batches. A few groups worked and successfully, however, some groups had problems. such as :

FileNotFoundError: [Errno 2] No such file or directory: '/06-1000drep99/data/greedy_clustering/fastANI_out_uypngsdoua'

or when run in --run_tertiary_clustering


..:: dRep dereplicate Step 1. Filter ::..

Will filter the genome list 645 genomes were input to dRep Calculating genome info of genomes 100.00% of genomes passed length filtering


..:: dRep dereplicate Step 2. Cluster ::..

Running primary clustering Running pair-wise MASH clustering 136 primary clusters made Running secondary clustering Running 70819 fastANI comparisons- should take ~ 43.1 min Step 4. Return output


..:: dRep dereplicate Step 3. Choose ::..

Loading work directory Calculating centrality using Mash


..:: dRep dereplicate Step 4. Evaluate ::..

Running tertiary clustering on genome representatives Running primary clustering Running pair-wise MASH clustering 123 primary clusters made Running secondary clustering Running 421 fastANI comparisons- should take ~ 10.6 min FileNotFoundError: [Errno 2] No such file or directory: '/08-645drep99/data/tertiary_clustering/data/fastANI_files/fastANI_out_zdxfyzlrad

In general, this situation has been running for several days, the first case does not get a de-redundant genome, and the second case reports a de-redundant genome. I have thought for a long time to find the reason, I wonder if you can provide some help, I would appreciate it.

Best wishes Guanyue

At 2023-06-13 00:28:46, "Matt Olm" @.***> wrote:

Hi @weiguanyue -

Yeah, when using the parameter --greedy_secondary_clustering it's also recommended to add --run_tertiary_clustering. This just ensures that you don't accidentally over-split clusters.

-MO

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

MrOlm commented 1 year ago

Hi @weiguanyue - huh, that's a bug I haven't encountered before. It could be a problem with fastANI not being able to handle large genomes? The two things I can think to do are 1) update fastANI and 2) if that doesn't solve it, using another algorithm (like ANImf) instead.

Best, Matt