brendanf / optimotu

Optimize OTU clustering thresholds
2 stars 0 forks source link

ASV preperation #3

Open LukeLikesDirt opened 3 weeks ago

LukeLikesDirt commented 3 weeks ago

Hi brendaf

I have a dataset of ~140,000 ASVs that I want to tun through optimotu. The ASVs have been assigned taxonomy. Do you have any examples or recommendations for preparing the ASVs before running them through otimotu?

Cheers Luke

brendanf commented 3 weeks ago

What exactly are you planning to do by running them through optimotu? Optimize thresholds or do the final clustering? If final clustering, what thresholds will you use?

LukeLikesDirt commented 3 weeks ago

For context, I am working on a publication for a soil fungal database. One reviewer suggested that the 97% threshold I used represents a compromise across different taxonomic groups, which is both obvious and fair. They recommended adopting your approach, and I am keen to integrate this dynamic clustering method into my workflow, as it is a valuable contribution to fungal bioinformatics.

However, due to compatibility issues with USEARCH on my Mac, I am unable to experiment with optimotu locally and need to run it on the HPC, which I had planned to do eventually. Additionally, because there currently is no manual for optimotu, I don’t think I can get this running without your help.

I have assigned taxonomy using dnabarcoder and UNITE v10.0, so I suppose I can use the clustering thresholds that have been developed here. Are you able to provide the steps I need to take to do the final clustering using the dnabarcoder thresholds? Or do you recommend I apply the thresholds with optimotu first?

brendanf commented 3 weeks ago

Ok, I see. Unfortunately this package doesn't put together everything needed for the full dynamic clustering pipeline, it is only the core workhorse functions. The full stand-alone pipeline is still in the process of being cleaned up for general release. Several papers are published using development versions, which all have the version they used published in their supplementary info. E.g. Saine et al. 2023, Saine et al. 2024, Burg et al. 2024, and Ovaskainen et al. 2024. You could look at the code there, but if you aren't already pretty familiar with targets, the workflow manager used for the pipeline, they may be difficult to learn from. Also I admit the code is pretty messy.

I am planning to get a public release of the pipeline ready this fall, and also to integrate it as an option in PipeCraft but that may be too slow if you are already working on a revision.

Regarding thresholds, I have used a "bootstrap" method in Ovaskainen et al, where the thresholds are optimized based on the taxonomic identifications from the data. For the other projects, which are more limited in geographic scope, I have used the thresholds from Ovaskainen et al. However, for soil data I would recommend deriving new thresholds because there are a lot of taxonomic groups which are prevalent in soil but not well represented in Ovaskainen et al. (which is air samples). Also if you have used a different amplicon than us, the thresholds will be somewhat different. I have unfortunately not had good luck optimizing thresholds based on the full Unite database, probably due to some combination of mislabeled sequences, unnatural taxonomy, and long branches. I would be hesitant to use thresholds from dnabarcoder in optimotu, since they use different alignment algorithms and will thus calculate somewhat different distances; but I expect the differences to be minor.

I would be willing to meet on zoom to offer some pointers about what is needed to implement the dynamic clustering algorithm and to try to explain any confusion you have after looking at the pipeline code. brendan dot r dot furneaux at jyu dot fi.

LukeLikesDirt commented 3 weeks ago

Thanks for your thoughtful reply. My first step was to look at the code associated with Ovaskainen et al. 2024. However, because I am not familiar with targets, I found it hard to follow. I appreciate your offer to help and will send you an email now to organise a Zoom. Please let me know if you don't get it.

Cheers Luke