Questions regarding the proper use of phold

bhagavadgitadu22 commented 5 months ago

Thanks for the tool it sounds very useful! I want to use it to annotate the viruses I find in my metagenomes but I have a few questions concerning the use of the tool:

1) In the literature (example the article "Structure-guided discovery of anti-CRISPR and anti-phage defense proteins" from last month), they use a TM-score>0.6 between an unknown protein and a known one to predict the function of the known one. The default thresholds in Phold are an e-value of 1e-3 and a sensitivity of 9.5 for Foldseek. How does the default thresholds of phold compare to this score? My guess is that it is less sensitive because the aim is to get true annotations rather than extreme novelty. Also you take a stricter e-value cutoff for CARD hits and I am not sure why?

2) Phold makes a great use of sequence and stucture alignments to make a maximum of protein annotations. Do you feel like large language models might improve the result of Phold by providing at least the PHROG category of some unknown genes? The results obtained in "Large language models improve annotation of prokaryotic viral proteins" in 2023 sounded promising

3) To further improve the annotations, I feel like using the colocalization of viral genes might work. PHROG incorporates a network of colocalized genes: do you think it might be leveraged to make a decision beween several hits that would be as likely otherwise?

4) Overlapping genes are not provided by any viral annotation tool I know. What I do so far is looking of potential overlapping genes by making blastp requests within the viral genes to find potential additional genes. Would there be a way to look for and add likely overlapping genes to Phold output?

gbouras13 commented 5 months ago

Hi @bhagavadgitadu22,

These are awesome questions - and I will consider them (especially 1) as I write up a manuscript.

To answer one by one:

I don't know the relation between TM-score and thresholds (yet), it is something I am benchmarking. The reason the default thresholds are 1e-3 and 9.5 as my benchmarking has shown these correspond the most closely to running 1e-2 and 9.5 with Foldseek on structures, which were the thresholds chosen benchmarked thoroughly in the Foldseek work.

My feeling is that phold is probably a bit more sensitive than TM score 0.6 (perhaps with more false positives), but not sure until I benchmark it. Regarding the higher e-value for CARD, that it is to reduce false positives for these types of genes based on this paper (https://www.nature.com/articles/ismej201690). Also, I consider that reducing false positives for phage therapy users is very important for AMR and virulence factor genes, hence the higher threshold.

Phold does make use of pLMs. You can think about phold as:

i. generate pLM embedding for query protein per residue (ProstT5 encoder) ii. 20-class CNN predicting the 3Di state for the residue from the embedding per residue (ProstT5 CNN head) iii. Foldseek using these 3Di (plus AA) for each protein vs Phold DB

There is scope potentially to look at using the embeddings in another way, but I think I'd go along the lines of comparing them directly (e.g. https://github.com/Rostlab/EAT) rather than construct a 10-fold classifier on PHROGs like that paper - mostly because I like the extra specificity of annotation rather than the broad PHROG category. Unknown genes are hard!

Yes - my colleague Susie has developed Phynteny (https://github.com/susiegriggo/Phynteny) that does this. Check it out! I would recommend running it after Phold (it accepts Phold output as input).
This is a really tough question. Phold assumes gene prediction has been done. Overlapping genes are really hard to detect, because any database centric approach probably relies on gene calls that are not overlapping. If you have a solution for this let me know and I'll add it to phold :)

George

bhagavadgitadu22 commented 5 months ago

Thank you for the detailed answers!

About 2 and 3. I agree categories are not sufficient full annotations are very desirable. Phynteny sounds great to get more categories.
About 4. I looked more into overlapping genes and I think that would be great to have an option to incorporate them. Many tools have been published to identify potential overlapping genes but none is very sensitive, specific and easy to use. The most recent one seems to be https://github.com/chasewnelson/OLGenie. I will give some thought about how to incorporate that search into viral metagenomic annotation.

Follow-up questions about performance:

Concerning the use of Phold on metagenomes, I am concerned about performance issues for the amount of viruses I have (about 2000). Would it make sense to speed up Phold workflow by dereplicating the genes before annotating them?
Do you have a suggestion about the amount of resources I need to use Phold on 2000 viruses. How many cores and memory would I need if I work on GPUs vs CPUs?

gbouras13 commented 5 months ago

Thanks for the link @bhagavadgitadu22 , please let me know how you go with the overlapping gene question.

To answer the performance Qs, Phold is approximately linear in terms of compute. So I would try 5 or 10 viruses first and see if you think you have the compute to scale from there. Without knowing your setup, if you have access to anything above a laptop, 2000 viruses should be doable.

Certainly split the run into 2 steps - phold predict with GPU and phold compare with CPU (as many threads as you can afford). Compare will likely take longer than predict, unless you have a really old/slow GPU.

To give you some context, on a machine with RTX4090 and Intel i9-13900 (32 threads), phold took 59 minutes on 249 phages with 22k CDS.

If you don't have a GPU, well, phold predict on CPU will take forever with 2000 viruses as it is single threaded. If you send them to me I could run it for you.

George

gbouras13 / phold

Questions regarding the proper use of phold #33