AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
197 stars 25 forks source link

Expand HMM searching beyond Pfam? #24

Closed halexand closed 5 months ago

halexand commented 3 years ago

Hi @AstrobioMike,

I have been loving GToTree! I was just wondering if you have any plans to allow HMMER searching of target genomes using something other than PFam accessions (e.g. custom HMM profiles or perhaps the KOFam databases)? Asking as PFam doesn't have all the genes I would like.

Referencing this point from your wiki:

GToTree uses HMMER3 to search each genome for any additional target genes of interest based on PFam accessions, and produces an Interactive Tree of Life-compatible mapping file for easy visualization.

Apologies if this is already a feature and I missed it in the documentation!

AstrobioMike commented 3 years ago

Hey there, Harriet!

Awesome, thanks! :)

And thanks for writing in about this, and I don't think you're missing anything. I'm a bit behind on a couple things regarding additional HMM capabilities (whether for making the actual tree, or for searching for extra stuff), mostly because it will require implementing something new for handling cutoff thresholds in the different ways people have them defined and want to provide them, and it can get kind of expansive trying to cover all the different types of cutoffs people might have and where they are stored (as in, in the HMM or separate file). So i think I've just sub-consciously defaulted to not increasing this functionality until specific use-cases arise, haha, so this is perfect :)

I was excited thinking about adding in KOFams in the same way PFams are there – so that we could just provide a file with KO IDs we'd want to find, and the rest could be taken care of for us. But I'm seeing now that's gonna be a bit trickier for 2 reasons: 1) the KO HMMs aren't available individually, so we can't just build a link to download the individual ones we want, rather we'd have to download them all (not a show-stopper though); and 2) the more complicated problem is that for some KOs they use a domain-level cutoff and for others they use the full-protein score cutoff, so the HMM command is going to have to be built differently for each, and I want to spend some time better understanding how exactly kofamscan handles doing this and their filtering (unfortunately i don't know ruby at all, so trying to find my way through the kofamscan codebase to see exactly how they are doing things is not immediately going as smoothly as i'd like, ha). Packaging kofamscan with GToTree is an idea that would make all of this straightforward. But that's kind of a hefty addition to tack on, so i'm not sure if i'm there yet. I think i need to spend some time looking at/thinking about this before adding a fuller and standard integration like you got me excited about

In the meantime, there could and should be a way to use any custom HMMs, which I'd be happy to add in. I'm just still not sure the best way to communicate the cutoff thresholds for each individual HMM from the user to the program (including the type and where they are located). Are you thinking of some specific KO HMMs only, or are you also thinking about doing it with other HMMs that might have their cutoff scores stored differently?

The last thing I'd suggest is if I wanted to do this now, and so needed to outside of GToTree, I'd probably use my bit-dl-ncbi-assemblies program (available in this toolkit) to download all the amino-acid files of the target genomes, as that can just take the same input genome accession file as the one given to GToTree, and then do the HMM scanning however is easiest.

halexand commented 3 years ago

Hey Mike!

Thanks for the detailed responses! I am not surprised that it was already on your radar. Taking a look at the kofam list now and I see what you mean regarding cutoffs being a bit tricky. I am afraid I don't know ruby either... but I am curious to follow a long as you move forward with integration.

To answer your question:

Are you thinking of some specific KO HMMs only, or are you also thinking about doing it with other HMMs that might have their cutoff scores stored differently?

Both! There are some genes I am curious about that exist outside of the realm of KEGG. But, for the immediate future this is specific to KO HMMs. I had been doing this a bit manually (building alignments, creating profiles, iteratively searching, and then determining cutoffs). That being said, I think that having KO HMMs available would be amazing!

For now, I will totally check out your bit toolkit (super cool! didn't know about it) and use kofamscan (as right now I am mostly thinking about some pesky KOs.

Excited to see how it progresses!

AstrobioMike commented 3 years ago

Stellar, thanks, Harriet :)

I'll focus on integrating KOFams first. And thanks for sharing your workflow, looking it over reminded me that we could also search and just report everything, and allow the user to do whatever filtering we want (this is obvious now, but i was apparently kind of stuck in thinking it needed to be all or nothing, and output filtered and finished count tables and seqs for some reason). But seeing how you're dealing with filtering and looking at things later anyway, i see it could still be useful to be able to do the searches on the fly with all genomes being pulled, even if it means doing some parsing of the additional HMM search results afterwards.

I'll keep you posted :)

AstrobioMike commented 2 years ago

KO searching is finally implemented as of v1.6.31 :)

I ended up just incorporating kofamscan itself; all the data is downloaded and stored if/when it's used for the first time; a list of target KOs is passed as a single-column file to the -K parameter; and at the start of the run, we parse down the full KO data to just the targets and scan each genome as we go 👍

Hopefully it might still be useful to you sometime despite it being 16 months later, haha

Regarding custom HMMs, if you have one or two you have made that you can share with me, and let me know if you typically use e-value or score cutoffs (or both), i think I can implement something pretty quickly this time :)

ckeuschn commented 1 year ago

Hi @AstrobioMike,

Great software, really helpful for my work! I was also interested in using my own HMMs to look into gene distribution in genomes of of specific branches where I find my MAGs fit into. I attached two HMMs I used next to others currently and they are in the folder structure required to run them on ANVI'O databases. Would that structure maybe an option to implement it in GToTree? The score and some information is stored in text files next to the model. I am sure I am missing something here, since I am a pure user of software.

Anyways maybe you can point me to a workaroud here. What I need in the end is (obviously) a tree file and some output which can be used to highlight the genomes with hits of the HMM in the tree (like you implemented so nicely for PFAMs).

Any suggestions are highly appreciated! Reading this thread I will have a look at your bit-gen-iToL-map maybe that gets me where I need to be :) Thanks Christoph

cymA.zip bssA.zip

AstrobioMike commented 1 year ago

Hey there, Christoph!

Thanks for the kind words :)

And i'm glad you brought this up again (the ability to count targets from any given hmms in all input genomes). I mentioned in my last post above that once i knew the scoring cutoff we want to use, i think I could implement something pretty quickly. But that was probably only true then, when i had just been deep in the GToTree weeds and looking at this, ha. It might take me a little bit to get this integrated

But I think a much easier path for me to do so, with how things are done currently, would be if we provide a tab-delimited file where the first column holds the path to the hmm file, and the second column holds the E-value cutoff. So in this case, it might look something like this (if the hmm files were in the current working directory):

hmm_path    e_value_cutoff
cymA.hmm    1e-5
bssA.hmm    1e-5

And then we'd provide that tab-delimited file to a new input parameter with the main GToTree call.

Do you think that kind of setup would work for you?

(also, in the meantime, like you mentioned bit-gen-iToL-map above, i think that should certainly help you quickly make the corresponding file for coloring things in iToL if you have already screened the genomes for your functions of interest 👍)

Thanks again!

ckeuschn commented 1 year ago

Hi, yes the soluiton with a tab delimited file would be most fantastico :). Do you think it might be worth also including a column specifing the sequence bit score which is passed to the -T flag in hmmsearch?

hmm_path e_value_cutoff bit_score_cutoff cymA.hmm 1e-5 100 bssA.hmm 1e-5 150 Although, if the model is searched on each genome individually, you should have similar database sizes so e-value might be all one needs.

Cheers!

AstrobioMike commented 1 year ago

Oh I didn't see there was a bitscore cutoff included in the examples you attached, I only found the e values in the noise cutoff file, so thought that's how you had established the filtering for these (maybe that's just all anvio includes)

Generally speaking, I prefer bitscores over e-values. But with hmms, I've only used gathering thresholds before. Do you know if we can specify both cutoffs (evalue and bitscore) to be applied with one HMMER search? I can of course figure that out when I get to working on this, but figured I'd ask in case you know for sure already (edit: looking through the hmmer manual, it sounds like it's one or the other)

Also, I feel like it's difficult enough to determine one cutoff for a new model someone is constructing. Do you typically make new ones and then establish a suitable evalue and bitscore cutoff, rather than just say a bitscore cutoff?

Sorry for all the questions, just want this to be useful AND as straightforward as reasonably possible. So I don't want to require more in the input file than what would be suitable for the typical use-case :)

ckeuschn commented 1 year ago

Yes, usually that is what I do. I make a model and then evaluate a suitable e-value (and bit-score if applicable) threshold where I trust the outcome. But in the case of GToTree it might be really sufficient to go with the 'sequence e-value' as suggested by you. Generally you can specifiy both in an hmmsearch the e-value and bit-score as a threshold, yes. Thanks again for your interest to implement that!

AstrobioMike commented 1 year ago

Thanks, Christoph

It looks like in the manual that you can't filter on both evalue and bit score, e.g.:

image

Is that out of date maybe? (Or maybe it's just odd wording and setting both will actually use both, I'd have to test)

I only put e-value at first because that's what I saw in the files you attached. I'd actually be inclined towards bit score rather than e-value (if not having a gathering threshold as an option). Do you have a preference for e-value vs bit score if only one can be used by hmmsearch?

ckeuschn commented 1 year ago

Yes you are right, only one of them is possible: Failed to parse command line: Option -E is incompatible with option(s) -E,-T,--cut_ga,--cut_nc

Actually I would also go for bit-score rather than e-value then!

AstrobioMike commented 5 months ago

This hasn't come up enough yet to force me to find the time to implement some other way to use custom HMMs. So I'm closing this for now.

I'll leave a reminder/note here though that "gathering" cutoffs already work. So if folks want to use their own HMMs with GToTree, they just need to have GA cutoffs in the actual HMM files (like discussed in this comment) 👍