KennthShang / CHERRY

Host prediction for phages
GNU General Public License v3.0
20 stars 8 forks source link

The question regarding the method of obtaining multiple host prediction results for a single viral isolate using CHERRY #8

Open actledge opened 10 months ago

actledge commented 10 months ago

I tryed to test CHERRY using my phage genomes, but encountered some issues. Firstly, in the paper of CHERRY, it is mentioned that the top [n] results with a score>0.9 can be used as predictions for "multiple hosts". However, when I specified --topk 5, I found that the cherry_prediction.tsv file in the output directory only contained the first result, but I found the specified number of outputs in under the [midfolder] directory. I am not sure if this is the correct file, so I would like to ask the author for the recommended method to obtain multiple host prediction results. cherry_prediction.csv

Additionally, Attached is midfolder/cherry_predcition.csv of my test sample, which is a phage isolated from Klebsiella pneumoniae. I'm not sure how to interpret these results because when I tried setting --topk 200, I found that the first 195 results had a score of 1 and the last 5 results had a score of 0.99. The last column showed TYPE=CRISPR. Does this mean that this phage has a perfect CRISPR match with the 195 listed hosts across different taxa, resulting in a score of 1? Does the last column, TYPE=CRISPR, indicate that all the predicted results are based on CRISPR?

Also, when the scores are the same, does it mean that the confidence level of these host prediction results is the same? Because I found that I could only find the host Klebsiella pneumoniae, which I determined through experiments, by setting a very large --topk value. Klebsiella pneumoniae also had a score = 1 but was ranked very low in this single result.

Therefore, my questions can be summarized as follows:

  1. Is top [n] results with score>0.9 in midfolder/cherry_prediction.csv the recommended way to view multiple host prediction results?
  2. How should I interpret the large number of results with score = 1 and the last column showing TYPE=CRISPR? Does it mean that all the predicted results are of the CRISPR type?
  3. If a score>0.9 is considered usable according to the paper, then in my sample, the lowest value among the top 200 results is 0.99. This suggests that there could be several hundred or even close to a thousand potential hosts with a score >0.9. Is this normal? If it is correct, how should I interpret it? Or is it a bug?
KennthShang commented 10 months ago

Hi there,

Thanks for using our tools.

However, first of all, please refer to the paper, we did not claim the program can predict the host range of the query phage, but the topk highest possible bacteria that the phage might infect. This two evaluations/statements are totally different. In all our experiments, the datasets only contains one-to-one relationship between virus and prokaryotes. There is no host range information so that the model do not know how to predict multiple host but predict the most likely prokaryote as host.

In the evaluation in section 3.5, we consider whether the topK predictions contain the host given by the dataset. And our conclusion is even if the scores of the real virus–host interactions are not the top-1 score, their scores are usually larger than 0.9. We did not claims user can use 0.9 as score to predict any host range of the given phages.

Back to your other questions:

  1. the TYPE only shows the top1 (the first score). So the TYPE=CRISPR means the first score is predicted by CRISPR.
  2. There are two situations here: 1) If the query phages is predicted by the CRISPR, the program will not go though the deep learning model part and directly output the results. If you select a large k, the reuslts of the k will be randomly generate to prevent any abnormal situation caused by the invalid parameters. 2) Also, because we only keep two decimal places, alghouth the score is slightly different but will show the same.

BTW, we are also trying to solve the host range prediction problem and made some progress. However, limited by the known host range dataset, we cannot perform large scale experiments. If you have any idea, feel free to let us know.

Best, Jiayu

actledge commented 10 months ago

Hi there,

Thanks for using our tools.

However, first of all, please refer to the paper, we did not claim the program can predict the host range of the query phage, but the topk highest possible bacteria that the phage might infect. This two evaluations/statements are totally different. In all our experiments, the datasets only contains one-to-one relationship between virus and prokaryotes. There is no host range information so that the model do not know how to predict multiple host but predict the most likely prokaryote as host.

In the evaluation in section 3.5, we consider whether the topK predictions contain the host given by the dataset. And our conclusion is even if the scores of the real virus–host interactions are not the top-1 score, their scores are usually larger than 0.9. We did not claims user can use 0.9 as score to predict any host range of the given phages.

Back to your other questions: 2. the TYPE only shows the top1 (the first score). So the TYPE=CRISPR means the first score is predicted by CRISPR. 3. There are two situations here: 1) If the query phages is predicted by the CRISPR, the program will not go though the deep learning model part and directly output the results. If you select a large k, the reuslts of the k will be randomly generate to prevent any abnormal situation caused by the invalid parameters. 2) Also, because we only keep two decimal places, alghouth the score is slightly different but will show the same.

BTW, we are also trying to solve the host range prediction problem and made some progress. However, limited by the known host range dataset, we cannot perform large scale experiments. If you have any idea, feel free to let us know.

Best, Jiayu

Hi,

Thank you for your response. Regarding the third point, I'm still a bit unclear. In fact, I initially tried --topk 5 and --topk 20 for this phage genome, but the top 5 and 20 results all had scores = 1 and did not include the host (Klebsiella pneumoniae) that I had verified. That's why I tried to increase the value of k to examine further. And concerning this, I didn't quite understand what you meant by "when k is large, the results will be randomly generated." Could you please provide a more specific explanation? What issues arise when k is relatively large compared to when it is small, in terms of the top k results?

Additionally, regarding the way host prediction works, my understanding of the principle might not be sufficient. I would like to consult with you because I recall some studies that determine whether a phage can infect a bacterium based on the matches of CRISPR. In other words, if the CRISPR of a phage matches with multiple hosts, it seems to indicate an interaction between this phage and multiple hosts. And theoretically, maybe I didn't get it right, is CRISPR a relatively reliable indication of such interactions? So, when we run CHERRY with a phage, is it possible for it to find the CRISPR of that phage in multiple prokaryote hosts? If CRISPR of a phage could exsits in multiple hosts, why can't it reliably represent the interactions between a phage and multiple hosts, even if the results are obtained from several one-to-one relationships?

More specifically, in the results of my tests, although the phage I isolated is from Klebsiella pneumoniae, CHERRY also found its CRISPR in E. coli. Should I consider that it is highly likely to infect E. coli and other hosts that have a matching CRISPR?

Thanks!

KennthShang commented 10 months ago

I mean, because CHERRY is designed for host prediction (the most likely host), If CHERRY finds one CRISPR prediction, it will stop running the model and output the CRISPR prediction. It will not care what other potential might be. So, in the logic of software design, because you ask CHERRY to output 100 predictions, it has to generate the reset 99 results. And this progress is like randomly choosing 99 prokaryotes (because it is stopped as mentioned above)

For your second question: sure, CHERRY could find multiple CRISPR results but we only output the best one in the current program. However, if you wish to output all the CRISPR results, we can set up a script for your need (as we already have several extension versions of CHERRY shown in the guidelines. We are glad to upgrade the program for users need.) If you are not in a hurry. I will get back to you on/before Friday (according to my schedule this week)

Best, Jiayu

actledge commented 10 months ago

I mean, because CHERRY is designed for host prediction (the most likely host), If CHERRY finds one CRISPR prediction, it will stop running the model and output the CRISPR prediction. It will not care what other potential might be. So, in the logic of software design, because you ask CHERRY to output 100 predictions, it has to generate the reset 99 results. And this progress is like randomly choosing 99 prokaryotes (because it is stopped as mentioned above)

For your second question: sure, CHERRY could find multiple CRISPR results but we only output the best one in the current program. However, if you wish to output all the CRISPR results, we can set up a script for your need (as we already have several extension versions of CHERRY shown in the guidelines. We are glad to upgrade the program for users need.) If you are not in a hurry. I will get back to you on/before Friday (according to my schedule this week)

Best, Jiayu

I understand. So, in other words, when CHERRY identifies a host based on the CRISPR prediction, there is a high probability that it has already found a correct host, and no further meaningful results will be generated after that.

Thank you again for your response, and while I am not in a hurry at the moment, I look forward to this update.

KennthShang commented 10 months ago

Hi there,

The host range prediction is available via https://github.com/KennthShang/CHERRY_crispr_multihost/tree/main

Feel free to have a try and hope it will help.

Best, Jiayu

actledge commented 10 months ago

Hi there,

The host range prediction is available via https://github.com/KennthShang/CHERRY_crispr_multihost/tree/main

Feel free to have a try and hope it will help.

Best, Jiayu

Hi,

I attempted to run cherry_crispr_multihost using the same test genome. Since the original version of CHERRY predicted the host as Escherichia coli with a CRISPR score of 1, I compared the results of CHERRY and CHERRY_crispr_multihost. However, I found that there were no CRISPR blast hits for E.coli in any of the CHERRY_crispr_multihost and original CHERRY results (I manually checked each Accession in the blast results and searched the Accession in the dataset/prokaryote.csv). The final result of cherry_crispr_multihost also did not contain the E.coli. Therefore, Is it a bug in CHERRY? as there are no CRISPR from a E.coli matches to this sample.

The attached files include the CRISPR blast results and the final prediction result cherry_prediction.csv from the original version of CHERRY. If you need any other files, please let me know.

crispr_out.tab.txt cherry_prediction.csv

Thanks!

KennthShang commented 10 months ago

Maybe you should provide your fasta file so that I can reproduce your problem

KennthShang commented 10 months ago

BTW, which script did you run when you said CHERRY. It seems that I did not find the script of this folder will output the 'TYPE' of the prediction. Not sure why.

actledge commented 10 months ago

Maybe you should provide your fasta file so that I can reproduce your problem

Attached file RCIP0001.fasta.txt is the genome fasta file.

I used Cherry_single.py script in the PhaBOX to output the above result.

RCIP0001.fasta.txt

actledge commented 10 months ago

BTW, which script did you run when you said CHERRY. It seems that I did not find the script of this folder will output the 'TYPE' of the prediction. Not sure why.

This is my command used: python3 Cherry_single.py --contigs reads_and_genomes/RCIP0001.fasta --threads 10 --rootpth test_CHERRY/RCIP0001 --out out/ --dbdir ~/software/PhaBOX/PhaBOX/database/ --parampth ~/software/PhaBOX/PhaBOX/parameters/

KennthShang commented 10 months ago

Thanks for your report! I found there is a low probability bug indeed. Since the program will generate a sorted score for potential need, there might be a low probability that the score of prediction by the model is equal to the score of prediction by CRISPR. But because the score is the same, the sort function will randomly choose one of them as the highest, but the program will still label it as CRISPRs (since the score is 1 (highest)).

I have updated the latest version to the PhaBOX folder (both the web server and the local version); it will only use the CRISPR prediction as the output ( if there is a CRISPR match).

And thanks again!

Best, Jiayu

actledge commented 9 months ago

Thanks for your report! I found there is a low probability bug indeed. Since the program will generate a sorted score for potential need, there might be a low probability that the score of prediction by the model is equal to the score of prediction by CRISPR. But because the score is the same, the sort function will randomly choose one of them as the highest, but the program will still label it as CRISPRs (since the score is 1 (highest)).

I have updated the latest version to the PhaBOX folder (both the web server and the local version); it will only use the CRISPR prediction as the output ( if there is a CRISPR match).

And thanks again!

Best, Jiayu

Uhh, there might be one more bug. The threshold setting for identity in lines around 270 and 313 of Cherry_single.py seems to be incorrect. The blast results show values like "100" or "90.5," but the condition for identity is set to ">0.95". This causes CRISPR matches that do not meet the criteria of 95% coverage and 95% identity to be included as results in my test genome RCIP0001 above.

Furthermore, I still have a question to inquire about based on this. After fixing these two lines of code and running it again using the same fasta file (uploaded above), the obtained prediction result is "E.coli score=1 Type=Predict". However, when I checked the accessions that align with the reference sequence in [rootdir]/midfolder/cherry_results.tab, although I haven't checked all of them, I manually searched several accessions and found them to be Klebsiella phages. In fact, the closest genome to this genome in NCBI is a Klebsiella phage with a similarity of approximately 82% (similarity = blast coverage x identity).

However, I understand that Cherry may rely on other features such as protein characteristics for prediction. So, my specific question is, based on this particular test genome fasta, what features and results does CHERRY use to predict the host as E.coli with a score of 1? Because, after all, this genome does not have annotated high-quality CRISPRs, and there are also related Klebsiella phage genomes found by [rootdir]/midfolder/cherry_results.tab.

Thanks for your responses!

YuanZeng

KennthShang commented 9 months ago

Good to know. I have revised them accordingly.

Based on the algorithm design, if the CRISPRs do not predict the sequence, CHERRY will use all the features (mentioned in the paper) for prediction. However, it is hard to say which one or two features lead the prediction depending on this model structure.

When you mention cherry_results.tab, yes, you might find some diamond blastp results showing the 'Klebsiella'. However, according to our previous experiments, phages from different families can share highly similar (even evalue=0) proteins. So, the blastp results cannot be directly used as a prediction. (Also, you do not need to check the file [rootdir]/midfolder/cherry_results.tab. There is the same file named blast_results.tab in your output folder.

In this situation, I think maybe the model just gave an incorrect prediction (as we report in the paper, only 78% accuracy on the independent test set and even lower on the real sequencing data depending on the input length). And the results of CHERRY might be affected by many things such as the imbalanced dataset (you know, E. coli always has the highest number of samples existing in the database).

BTW, we are working on upgrading CHERRY for better usage and higher accuracy, and we are also glad for your report. If you have further needs or suggestions, we will consider to add it in our CHERRY2.0 version.

Best, Jiayu