How can I use this with custom genomes?

jolespin commented 6 months ago

Is there a way to use this with custom genomes that are acquired de novo?

Also, would it be possible to support sylph or skani or in future versions as an alternative to sourmash?

mahmudhera commented 6 months ago

I am not sure if I understand the question - do you want to use custom genomes as inputs instead of metagenomes?

If yes, off the top of my head, I think this should still work. As long as there are k-mers representing a gene in the sketch, the gene and the KO should be identified in the pipeline’s output. Although in that case, the relative abundances may not be very useful, rather the presence/absence of the KOs may be the more reliable result.

If no, could you kindly elaborate?

I haven’t thought about using sylph or skani, perhaps @dkoslicki may give us some insights.

jolespin commented 6 months ago

I am not sure if I understand the question - do you want to use custom genomes as inputs instead of metagenomes?

Yes, basically I have a bunch of genomes and I'm interested in whether or certain KEGG orthologs are present without doing a large protein alignment or HMMSearch.

If yes, off the top of my head, I think this should still work. As long as there are k-mers representing a gene in the sketch, the gene and the KO should be identified in the pipeline’s output.

In the manuscript, did you by any chance check out false positives?

Although in that case, the relative abundances may not be very useful, rather the presence/absence of the KOs may be the more reliable result.

Excellent. That's what I need anyways.

I haven’t thought about using sylph or skani, perhaps @dkoslicki may give us some insights.

I would give it a try. AFAIK sylph and skani are built using the same k-mer methodology from the same developer. Sylph is k-mer based profiling and Skani is for ANI calculations. They are extremely fast and I use them quite a bit in my research.

dkoslicki commented 6 months ago

Just to chime in, what @mahmudhera is correct: you could just treat the genome(s) as a (very simplified) metagenome. If you want better recall, I would suggest a smaller scale factor (so the scale=500 version of the pre-built data). Personally, I'd lean towards an even smaller scale factor if your compute system can handle it. Let us know if you'd like a pre-built database with a scale factor closer to 10

jolespin commented 6 months ago

If you think it's worth a shot I can try it out. I have a bunch of genomes where I ran KOFAMSCAN so those can be considered true positives.

dkoslicki commented 6 months ago

I'd be interested in hearing about what you observe. It definitely wasn't the original use case we designed the pipeline for, but it might still work (with lower recall, though)!

jolespin commented 6 months ago

I'd like to dive deeper into this. Ideally, I would like to be able to use this as a faster alternative to HUMAnN with the following usage:

Input:
- Genome fasta
Output:
- Table of KO Abundance
- Table of KEGG Module Module Completion Ratio

I can use the following script to calculate the latter: https://github.com/jolespin/veba/blob/main/bin/scripts/module_completion_ratios.py

If this can work well, this would be insanely useful. I can help implement/test if necessary.

Let us know if you'd like a pre-built database with a scale factor closer to 10

Would you mind trying out 10, 100, and 250?

Is the sketch database loaded entirely into memory?

ctb commented 6 months ago

on the sourmash/branchwater side, we'd love to help enable this, obviously.

Is the sketch database loaded entirely into memory?

we have both in-memory and on-disk approaches implemented in the branchwater plugin for sourmash. I'd have to look to make sure both are compatible with the desired output, but IIRC this is using sourmash gather so it should be fine.

dkoslicki commented 6 months ago

I'll have to check about the loading into memory, but we're actually using sourmash prefetch, as we found that gather just further decreased the sensitivity without much gain in specificity.

Would you mind trying out 10, 100, and 250?

Sure! I'll let you know when they're formed

ctb commented 6 months ago

I'll have to check about the loading into memory, but we're actually using sourmash prefetch, as we found that gather just further decreased the sensitivity without much gain in specificity.

k. Our on-disk branchwater stuff will work for prefetch too, but we should make sure we have the output formats right! Most of our work with different output formats has been on gather.

jolespin commented 6 months ago

Circling back just in case it's useful. Regarding supporting Sylph in addition to Sourmash, could be worth looking into downstream maybe later versions.

Although sketching methods tend to be efficient, previous implementations such as Mash screen [19] or sourmash [21] have accuracy issues for low-abundance genomes. This is due to k-mer content being missing [19] as a result of sequenced reads not fully covering the genome, obfuscating ANI calculation. Thus, arbitrary thresholds are also required to identify present genomes, an unsatisfactory solution given the importance of detecting low-abundance microbes [22].

https://www.biorxiv.org/content/10.1101/2023.11.20.567879v2.full

Note: Not associated with this work at all but I've been using it quite a bit and the developer is very helpful with accommodating new features and explaining concepts.

dkoslicki commented 6 months ago

I'll definitely look into that; been on my reading list for a bit and this is a good excuse to prioritize it!

Related: Going after low abundance organisms, considering coverage, and what determining what it means for two organisms to be "the same" (as in, in your sample and reference), was exactly our motivation for making YACHT: https://academic.oup.com/bioinformatics/article/40/2/btae047/7588873 (note the increased completeness for YACHT in that link's figure 5 compared to sourmash gather). May or may not be useful to you, but thought I would mention it given the similar problem it's going after

ShaopengLiu1 commented 5 months ago

More sketches have been uploaded, please see below. They are NOT in SBT format though (this wouldn't affect results). https://github.com/KoslickiLab/fmh-funprofiler?tab=readme-ov-file#more-pre-built-sketches-with-different-scaling-factors

KoslickiLab / fmh-funprofiler

How can I use this with custom genomes? #4