Creating HMM profiles - Githubissues

tenguzame commented 5 years ago

Greetings, I'd like to use this piece of software on a custom HMM profile (pVOGs for viruses). I tried to simply copy the .hmm profile into the hmm_sets directory, but the program doesn't recognize any genes within the collection. What did I do wrong? Best regards

AstrobioMike commented 5 years ago

Hi there!

Thanks so much for posting this. Not you at all, all me. This was a consequence of my relying on accessions from hmm profiles (denoted as "ACC" in the header material for a particular profile), whereas some might not have an accession. This didn't come up in my testing, nor was it something i was thinking about (but should have been!).

I'm working on fixing this and will let you know as soon as it's taken care of.

Thanks again for posting the issue!

AstrobioMike commented 5 years ago

So from v1.1.8 on, this won't be a problem anymore. Just to note in case it helps you, or anyone else who sees this if you were already on top of it, peeking at the pVOG hmms it seems they all come packaged individually. Meaning if i download the Corticoviridae hmm files from this page, which has 5 hmm profiles in it, it comes in a tar.gz with all 5 hmms separate. To work with GToTree these will need to be concatenated together into one file. For example:

# downloading 
wget http://dmk-brain.ecn.uiowa.edu/VOG/downloads/Corticoviridae/CorticoviridaevogHMMprofiles.tar.gz

# decompressing/unpacking
tar -xzvf CorticoviridaevogHMMprofiles.tar.gz

# catting together
cat CorticoviridaevogHMMprofiles/*.hmm > Corticoviridae.hmm

Then that combined file (here called "Corticoviridae.hmm") can be moved to the hmm_sets directory of GToTree, or pointed to directly in the program call. If moved to the hmm_sets directory, the gtt-hmms program should report it properly:

gtt-hmms

But also, just to note I love the idea of viral phylogenomics, but i also have no idea about how it works. By default, GToTree doesn't keep genes if there are more than one hit found in a genome. This may not be appropriate for doing it with viruses, i don't know. But if you want, you can change the behavior to keep genes from genomes with multiple hits and use the best hit in the alignment by adding the -B flag. Also, of course ignore all the % completion/redundancy estimations and messages GToTree spits out, they are mostly guided by prokaryotic genomes. I don't personally know if GToTree is appropriate for viral work like this as i've never done any of it before. Please let me know if you find that this is or isn't suitable for your viral needs!

And thanks again for reporting the problem!

drish91 commented 4 years ago

Hi Mike,

On the same lines as this question, is there a minimum gene requirement for GToTree to consider a custom hmm? I'm trying to build a gene-specific phylogeny for a few Shewanella genomes based on amylase, which was extracted from the reference Shewanella ISTPL2. I built a custom hmm for this purpose and scanned for it across all genomes (ran hmmscan independent of the pipeline because gtotree errored out since it couldn't find any SC hits). Hmmscan was able to detect the presence of the gene in my genomes, though not to 100% identity. Just to be sure, I also tried blasting the query amylase sequence I used to build the hmm, against its reference genome (which I included in the gtotree run and should have been captured), and that worked as expected as well.

My question here is: is the hmmscan in the pipeline more stringent than usual? If so, is there a way to tweak it on the user end, or is it hard-coded in the pipeline?

I've attached the error log below:

`############################################################################## ##############################################################################

GToTree is exiting without completing :(

############################################################################## ##############################################################################

** REASON FOR TERMINATION ** After filtering out genes that had either 0 hits in any genome OR only multiple hits, no genes remained. This typically shouldn't happen unless maybe there were very few genes being targeted, or very few genomes. You can consider running GToTree in "best-hit" mode (by providing the '-B' flag with no arguments), which will retain genes from genomes with multiple hits - but keep in mind that is less conservative. This is also a good time to make sure nothing weird is going on.

You can see the number of target-gene hits per genome by checking out
"test_shewanella/SCG_hit_counts.tsv".

`

AstrobioMike commented 4 years ago

Hi there, @drish91,

Thanks for writing in :)

There may a couple things going on here I want to look into. I will be able to dig in this evening.

Could you possibly send me some things so I can make sure I’m working on the same problem? If you could, please send me at Michael.Lee0517@gmail.com: 1) a fasta of the amylase sequence(s) you used to make the hmm; and 2) the custom hmm you built (and the exact command you used to build it if you have it handy)

I’m also curious, what happens if you run the GToTree command with the -B flag?

And what did you mean by “though not to 100% identity” here:

Hmmscan was able to detect the presence of the gene in my genomes, though not to 100% identity.

Thanks!

drish91 commented 4 years ago

Wow thanks for the prompt response, Mike! Yes, I've tried running the pipeline using -B, hoping it would solve the issue. I've attached the log for it (forgot to do that earlier, sorry about that!) gtotree-runlog.txt

What I meant by that was that looking at the query coverage vs the target length, I could tell it was not 100% coverage. Please also see attached hmmscan result.
amylase_hmmscan.txt

I also just sent an email with the amylase fasta, the sequence alignments and the custom hmm.

-Drishti

AstrobioMike commented 4 years ago

Hey there!

So I think I see what's going on :)

GToTree works with amino acid sequences ultimately, it doesn't support searching for/aligning/or treeing nucleotide sequences. I made it more for large-scale phylogenomic-type things rather than higher resolution of specific genes. While it accepts inputs in both forms, it is identifying coding sequences in anything provided as nucleotides and pulling the translated amino acid sequences for the hmm searching and the rest. I'm realizing now that despite my best efforts at good documentation, this isn't actually explicitly stated in any one place! haha, sorry for the confusion there

So that was the heart of the problem with it not finding anything. It was trying to scan amino acid sequences against the nucleotide HMM you were giving it. I'm going to have to add a check for this whenever custom hmms are provided so there'd be an actually useful error message given. It's on my list now, so thank you for writing in about this!

Couple more points to hit on:

Moving forward with some other main things about your goal though, even working with a custom built amino acid file, you made me also realize i forgot to integrate a way to run the search if there are no "GA" score cutoffs stored in the HMM profile. GA score cutoffs are "gathering" cutoffs that come stored in lots of large database HMM profiles, like those that come from Pfam. This can be seen at the top of the HMM file, like so:

This is a manually curated score, as noted in this article in the "Thresholds and compatibility" section:

The gathering threshold is the bit score threshold that a sequence must match or exceed in order for it to be deemed significant and therefore belonging to that family. In Pfam, we manually define our thresholds such that no known false positives are permitted in any family.

Because GToTree is usually working with multiple hmms at a time, for people to be able to set individual score cutoffs for each one will take me adding in an option for a new input file that can hold each hmm and the specific cutoff score for each one in it, and then pull that info when it comes time to searching. That's gonna take me a bit to write and implement, but that's on my list now too thanks to you :)

However, you can do this now still with GToTree using amino acid sequences if you'd like, without me needing to implement anything. It's going to be the case that when working with a single gene from closely related organisms, you would get better resolution with the nucleotide sequences like you were trying to do, but here's how it could be done with GToTree.

We'd just need to add a GA cutoff score to the hmm file. I don't know what that cutoff should be, as it's different for every HMM, but i did a quick few things just to come up with something reasonable in order to test this:

I translated the 3 genes you sent me into their amino acid sequences (just used this site cause there were just a few), and saved them in a file called "amylase.faa"
aligned them with muscle muscle -in amylase.faa -out amylase-aligned.faa
built the hmm hmmbuild --amino -n amylase amylase-amino.hmm amylase-aligned.faa

did a protein blast of the first sequence at ncbi in order to find the Shewanella genome with the closest blast hit, then grabbed all the protein sequences from it

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/519/065/GCF_000519065.1_ASM51906v1/GCF_000519065.1_ASM51906v1_protein.faa.gz 
gunzip GCF_000519065.1_ASM51906v1_protein.faa.gz

did an hmmsearch of that genome's genes against this hmm profile built from the 3 genes (this is to see what the hits look like, e.g. how many, what their scores look like, what their annotations are) hmmsearch --tblout amylase-hmm-out.tsv amylase-amino.hmm GCF_000519065.1_ASM51906v1_protein.faa
looked at the hit results (in the amylase-hmm-out.tsv file), there were 5 hits, with the "full sequence" scores 729.2, 548.8, 125.2, 46.1, and 23.2. Looking at that and the associated annotations at the right, i decided to use 540 as the GA cutoff value (the minimum whole-sequence score that would be kept):
then i manually added that line to the hmm profile, just following how it looked in one of the pfam-sourced hmm files
now if we run that above hmmsearch command but adding the flag to use the GA cutoff, we only get those 2 hits that were above that score hmmsearch --cut_ga --tblout amylase-hmm-with-GA-cutoff-out.tsv amylase-amino.hmm GCF_000519065.1_ASM51906v1_protein.faa

So now that hmm profile can be used with GToTree like so for example: GToTree -a ncbi_accessions.txt -H amylase-amino.hmm -o amylase-gtotree-run -B -G 1

And you will get at the end a single-gene amino-acid tree of the best hits found in each genome that had the gene detected. I'll note again there's nothing magical about that 540 number i chose as the GA cutoff, and it might be way too high for the things you'd want to catch. But that is the gist of what I would do if i were designing a new HMM, i would basically do that on a larger scale and empirically come up with a value that seems to catch what i know are the correct genes/what i'm interested in, but doesn't keep what i'm not. And as mentioned, doing this with amino acids might be fine depending on the diversity of the gene in the organisms you are looking at, but doing it with the nucleotide sequences like you were trying to do would give greater resolution if they are too closely related at the amino-acid level. GToTree just wasn't really built to do things at that level and isn't currently equipped for it :/

Last thing I want to note is that it is kind of tricky trying to make a phylogenetic tree with a gene that exists in multiple copies within genomes. One of the built-in assumptions in phylogenetics in general is that the genes being compared are under similar evolutionary pressures, which is tenuous to begin with even with single-copy genes, but gets a little worse when there are paralogs (multiple copies) within the same genome. It's certainly not "wrong" to do or anything, and there can be many good reasons for wanting to tree such genes individually and look at them more closely. But just wanted to mention that's something to be aware of. GToTree right now is just picking the one with the best score (when run with -B for best-hit mode).

Hope that helps clear some of the confusion up! Thanks again for helping me find more areas to improve GToTree :)

camilogarciabotero commented 4 years ago

Hi there Mike,

I wanted to build an .hmm for Bacillus subtilis, so I found the PhyloFacts page from the Berkeley phylogenomics group. I downloaded the specific data for Bacillus subtilis on the section:

Downloads for PhyloFacts families targeting specific species

This contains ~10312 .hmm files that I could concatenate on a single one. However, there are two problems I got at this point.

The number of CDS genes present on Bacillus subtilis genome is ~4100. Then the .hmm file is highly probable to display copies from the genes or might not be curated though.
After reading the previous comments on the present Issue, I wanted to inspect the hmm profiles I concatenated and found that they don't display the GA score.

So I was wondering whether you can give me some suggestions on this problem.

AstrobioMike commented 4 years ago

Hi there, @camilogarciabotero!

Happy to try to help, and improve GToTree if possible/practical :)

I need some more help getting on the same page first though. Looking at that site real quick (I’m not familiar with it yet, thanks for sharing), I don’t see exactly what that’s supposed to be. +10,000 HMMs makes me think it might be based off the pangenome of B. subtilis? Is that the case?

GToTree is good for making a phylogenomic tree based on genes that are conserved across all the genomes being considered (meaning present in mostly all of them), so I’m not sure I follow yet about why we might want to be searching for more things than are present in any one typical genome.

But happy to help sort this out, I may just be confused :)

camilogarciabotero commented 4 years ago

Hey @AstrobioMike,

I'm sorry for the confusion... As I mentioned before I want to build an .hmm profile for Bacillus subtilis with all the CDS genes it could have. So, on the page I found (PhyloFacts) there is a link to download Bacillus subtilis Alignments, trees and HMMS files:

Screen Shot 2020-01-30 at 9 23 25 PM

Given that this link showed a specific HMMS for Bacillus subtilis I decided to explore it and see if I could add a Bacillus_subtilis.hmm to the get-hmms of my local GToTree dir in order to reconstruct a phylogenomic tree using as many CDS genes from Bacillus subtilis genome (which it has been suggested that contains ~4100 CDS genes - Not sure how many SCG there are, though). When I opened the file It actually showed many files in different formats (.newick, .hmm, etc.) I counted the .hmm files on the directory and they summed up to ~10000 files. And after inspecting on the .hmm files I was concerned about what I numbered before:

The number of CDS genes present on Bacillus subtilis genome is ~4100. Then, some of the ~10000 .hmm files in the directory from the page might be copies... Or I really don't know why there are so many .hmm files 😅.

After reading the previous comments on the present Issue, I wanted to inspect the hmm profiles and found that they don't display the GA score. and I think that making the manually curation you did with the amylase gene could be hard on more than 10000 files...

Now, If you could help me on this issue (hopefully I was more clear on state it) it would be very cool.

Thank you for quick response, by the way!

AstrobioMike commented 4 years ago

Hi there, Camilo,

Sorry for the slow response this time :)

So there are a couple things going on here. When GToTree has clade-specific marker sets, they are specifically for conserved, single-copy genes across that clade (more details on that in numbers 1-4 under the table here). That's different than these HMM sets that are available at PhyloFacts. It seems those are HMMs of all CDSs found in all Bacillus subtilis incorporated reference genomes (as you've pointed out). But we most often wouldn't want to use all CDSs when trying to make a phylogenomic tree, because one of the assumptions built into phylogenetics is that when considering the same gene from multiple organisms, it is presumed be under similar evolutionary pressures in those organisms. This is already a pretty hard thing to ever know for certain, and is probably never actually true, but to be safe, it is best to use genes that are present only in a single copy across the genomes being considered, and not use any that are in multiple copies within individual genomes.

So a GToTree HMM set specifically designed for B. subtilis wouldn't include all genes found in all B. subtilis, but would scan all B. subtilis genomes for genes present in single-copy in almost all of them. I hope i'm not being more confusing than helpful here, ha

I put up all the code for how i did this with all bacteria for GToTree here, to serve as an example of how I made all of them. That code is specifically for bacteria, but it is the same process for whatever the target clade is. Maybe that can help you if you want to build one specific for B. subtilis that is based on single-copy genes.

To your second question, as for setting the GA cutoff value, I don't know of a good way to automate that for tons of HMMs. There needs to be something for filtering out spurious hits, and even if using something else like e-values to handle that, we would still really need to figure out what is the appropriate evalue cut off for each target HMM i think, and it would still take some manual figuring.

Last thing I'll note, since you're focusing on a single species, is that if all of the things you're looking at are extremely closely related, amino-acid alignments of conserved single-copy genes might not provide enough resolution to separate them out. GToTree is great for a wide-range of things all the way from really broad-level, like a tree of life, all the way down to single-genus or even single-species trees (depending on how similar the input genomes are). But if you are finding that the level of resolution provided by amino acid alignments of a few hundred single-copy genes is not enough to separate out the genomes you are looking at because they are just too closely related, you might want to consider building a SNP tree with something like snippy or parsnp that does whole-genome core nucleotide alignments and then can make trees of the varying positions.

camilogarciabotero commented 4 years ago

Hi Mike,

Thank you very much for your thoroughly response, tool recommendation and almost a complete lesson of biology.... I definitely will follow your suggestions on using a more precise and appropriate tool for my problem like those you mentioned above. Hopefully, will use GtoTree later and can't wait to test its updates and new features.

AstrobioMike / GToTree

Creating HMM profiles #3

GToTree is exiting without completing :(