Open jolespin opened 6 months ago
I am not sure if I understand the question - do you want to use custom genomes as inputs instead of metagenomes?
If yes, off the top of my head, I think this should still work. As long as there are k-mers representing a gene in the sketch, the gene and the KO should be identified in the pipeline’s output. Although in that case, the relative abundances may not be very useful, rather the presence/absence of the KOs may be the more reliable result.
If no, could you kindly elaborate?
I haven’t thought about using sylph or skani, perhaps @dkoslicki may give us some insights.
I am not sure if I understand the question - do you want to use custom genomes as inputs instead of metagenomes?
Yes, basically I have a bunch of genomes and I'm interested in whether or certain KEGG orthologs are present without doing a large protein alignment or HMMSearch.
If yes, off the top of my head, I think this should still work. As long as there are k-mers representing a gene in the sketch, the gene and the KO should be identified in the pipeline’s output.
In the manuscript, did you by any chance check out false positives?
Although in that case, the relative abundances may not be very useful, rather the presence/absence of the KOs may be the more reliable result.
Excellent. That's what I need anyways.
I haven’t thought about using sylph or skani, perhaps @dkoslicki may give us some insights.
I would give it a try. AFAIK sylph and skani are built using the same k-mer methodology from the same developer. Sylph is k-mer based profiling and Skani is for ANI calculations. They are extremely fast and I use them quite a bit in my research.
Just to chime in, what @mahmudhera is correct: you could just treat the genome(s) as a (very simplified) metagenome. If you want better recall, I would suggest a smaller scale factor (so the scale=500 version of the pre-built data). Personally, I'd lean towards an even smaller scale factor if your compute system can handle it. Let us know if you'd like a pre-built database with a scale factor closer to 10
If you think it's worth a shot I can try it out. I have a bunch of genomes where I ran KOFAMSCAN so those can be considered true positives.
I'd be interested in hearing about what you observe. It definitely wasn't the original use case we designed the pipeline for, but it might still work (with lower recall, though)!
I'd like to dive deeper into this. Ideally, I would like to be able to use this as a faster alternative to HUMAnN with the following usage:
I can use the following script to calculate the latter: https://github.com/jolespin/veba/blob/main/bin/scripts/module_completion_ratios.py
If this can work well, this would be insanely useful. I can help implement/test if necessary.
Let us know if you'd like a pre-built database with a scale factor closer to 10
Would you mind trying out 10, 100, and 250?
Is the sketch database loaded entirely into memory?
on the sourmash/branchwater side, we'd love to help enable this, obviously.
Is the sketch database loaded entirely into memory?
we have both in-memory and on-disk approaches implemented in the branchwater plugin for sourmash. I'd have to look to make sure both are compatible with the desired output, but IIRC this is using sourmash gather
so it should be fine.
I'll have to check about the loading into memory, but we're actually using sourmash prefetch
, as we found that gather
just further decreased the sensitivity without much gain in specificity.
Would you mind trying out 10, 100, and 250?
Sure! I'll let you know when they're formed
I'll have to check about the loading into memory, but we're actually using
sourmash prefetch
, as we found thatgather
just further decreased the sensitivity without much gain in specificity.
k. Our on-disk branchwater stuff will work for prefetch too, but we should make sure we have the output formats right! Most of our work with different output formats has been on gather.
Circling back just in case it's useful. Regarding supporting Sylph in addition to Sourmash, could be worth looking into downstream maybe later versions.
Although sketching methods tend to be efficient, previous implementations such as Mash screen [19] or sourmash [21] have accuracy issues for low-abundance genomes. This is due to k-mer content being missing [19] as a result of sequenced reads not fully covering the genome, obfuscating ANI calculation. Thus, arbitrary thresholds are also required to identify present genomes, an unsatisfactory solution given the importance of detecting low-abundance microbes [22].
https://www.biorxiv.org/content/10.1101/2023.11.20.567879v2.full
Note: Not associated with this work at all but I've been using it quite a bit and the developer is very helpful with accommodating new features and explaining concepts.
I'll definitely look into that; been on my reading list for a bit and this is a good excuse to prioritize it!
Related: Going after low abundance organisms, considering coverage, and what determining what it means for two organisms to be "the same" (as in, in your sample and reference), was exactly our motivation for making YACHT: https://academic.oup.com/bioinformatics/article/40/2/btae047/7588873 (note the increased completeness for YACHT in that link's figure 5 compared to sourmash gather). May or may not be useful to you, but thought I would mention it given the similar problem it's going after
More sketches have been uploaded, please see below. They are NOT in SBT format though (this wouldn't affect results). https://github.com/KoslickiLab/fmh-funprofiler?tab=readme-ov-file#more-pre-built-sketches-with-different-scaling-factors
Is there a way to use this with custom genomes that are acquired de novo?
Also, would it be possible to support sylph or skani or in future versions as an alternative to
sourmash
?