KennthShang / CHERRY_crispr_DATABASE

Database version of CHERRY CRISPR
1 stars 0 forks source link

Difference in the number of CHERRY_crispr_multihost and CHERRY prediction results #1

Closed trx296554555 closed 2 months ago

trx296554555 commented 5 months ago

Dear KennthShang,

I hope this email finds you well. I am reaching out to you regarding your open-source project and have a question regarding the difference in predicting the number of virus hosts using CHERRY_crispr_multihost.py and Cherry_single.py in PhaBOX.

I have noticed that when running CHERRY_crispr_multihost.py and Cherry_single.py with their default parameters, they provide different results. Specifically, CHERRY_crispr_multihost.py yields a lower number of host predictions whereas Cherry_single.py produces the opposite results. I only counted the number based on crispr.

Cherry_single.py Result
cut -f5 -d , host_prediction.csv |sort |uniq -c
    112 -
   2989 CRISPR
    170 Predict

CHERRY_crispr_multihost Result
$wc -l prediction.csv 
2566 prediction.csv

Additionally, I would like to inquire whether the values used in Cherry_multihost.py for "--ident" (the identity of the CRISPR alignments) with a default of 75, and "--coverage" (the coverage of the CRISPR alignments) with a default of 0.75, might be too low. Is it necessary for me to increase these threshold values, and if so, should I make adjustments in Cherry_single.py as well?

I truly appreciate your contributions to the open-source community, and I am very interested in your project. I would greatly appreciate any insights you can provide to help me better understand this discrepancy.

Thank you for your time and response.

Best regards, Robin

KennthShang commented 5 months ago

Hi there,

In the CHERRY main program, the thresholds are 95% coverage and identity. In the multihost version. Because user would like to predict more potential hosts for their phages. the threshold are smaller and more outputs will be remained.

Best, Jiayu

trx296554555 commented 5 months ago

Hi Jiayu,

Thank you for your response. If I understand correctly, the main CHERRY program uses thresholds of 95% coverage and identity. However, in the multihost version, the thresholds are smaller (such as 75%) to predict more potential hosts for phages, resulting in a higher number of outputs.

If my understanding is correct, does it mean that by using a stricter threshold of 95%, I obtained more results?

Best regards, Robin

KennthShang commented 5 months ago

First of all, there are two main differences between the two programs:

  1. The multi-host version will return multiple hosts (host range) for an input phage contig. However, the PhaBOX version will only return the one with the highest confidence.
  2. The multi-host version will only depend on a pre-built CRIPSRs database for prediction. However, the PhaBOX depends on a graph and will return 'evidence' based on the graph.

Then, back to your questions: If you are using the same program and adjusting thresholds and If a higher number of outputs refers to a higher prediction rate, I suppose a stricter threshold should return a lower predicted rate. However, comparing the two programs is hard to tell since they are of different designs.

BTW, the multi-host version currently is a beta test version and is not a formal part of the original paper. It is only designed for users wanting to predict the host range. If you only care about the expected host with the highest confidence, please use the PhaBOX version as the golden standard.

Best, Jiayu

trx296554555 commented 5 months ago

Hi, Thank you for your response. I appreciate the information provided. However, I still feel that my original question has not been fully addressed.

While I admit that I am not well-versed in the Graph Convolutional Encoder part, when it comes to the aspect of using CRISPR to establish the relationship between phages and hosts, I believe both CHERRY_crispr_multihost.py and Cherry_single.py (PhaBOX) utilize the same strategy. Specifically, they employ the 'blastn-short' to compare the pre-built CRISPR database against the potential virus sequence. This approach is widely accepted and commonly used for host prediction. Additionally, I have cross-checked the MD5 of the CRISPR database file and confirmed that they are indeed the same database.

Based on these factors, I would expect CHERRY_crispr_multihost.py and Cherry_single.py (PhaBOX) to yield similar results in this specific aspect, rather than the significant discrepancies I have observed. In this case, I have examined the alignment results of both programs, namely the 'alignment_result.tab' and 'tmp/cherry/crispr_out.tab' files, and they are completely consistent.

> alignment_result.tab 
C_Abbottina_obtusirostris=1 34.3097640|GCF_002866885.2| 4.43e-07    94.737  38  43
C_Abbottina_obtusirostris=1 30.3097386|GCF_002866885.2| 4.43e-07    100.000 34  43
C_Abbottina_obtusirostris=1 6.424270|GCF_008362895.1|   1.33e-06    100.000 33  33
C_Abbottina_obtusirostris=1 64.3944489|GCF_006243415.1| 3.98e-06    100.000 32  32
C_Abbottina_obtusirostris=1 43.1302212|GCF_002803925.1| 3.98e-06    100.000 32  32
C_Abbottina_obtusirostris=1 29.1301357|GCF_002803925.1| 3.98e-06    100.000 32  32
C_Abbottina_obtusirostris=1 91.320081|GCF_000819725.1|  3.98e-06    100.000 32  32
C_Abbottina_obtusirostris=1 24.1645998|GCF_000801015.1| 3.98e-06    97.059  34  34
C_Abbottina_obtusirostris=1 70.3944849|GCF_006243415.1| 3.98e-06    100.000 32  32

> tmp/cherry/crispr_out.tab
PhaGCN_0    34.3097640|GCF_002866885.2| 4.43e-07    94.737  38  43
PhaGCN_0    30.3097386|GCF_002866885.2| 4.43e-07    100.000 34  43
PhaGCN_0    6.424270|GCF_008362895.1|   1.33e-06    100.000 33  33
PhaGCN_0    64.3944489|GCF_006243415.1| 3.98e-06    100.000 32  32
PhaGCN_0    43.1302212|GCF_002803925.1| 3.98e-06    100.000 32  32
PhaGCN_0    29.1301357|GCF_002803925.1| 3.98e-06    100.000 32  32
PhaGCN_0    91.320081|GCF_000819725.1|  3.98e-06    100.000 32  32
PhaGCN_0    24.1645998|GCF_000801015.1| 3.98e-06    97.059  34  34
PhaGCN_0    70.3944849|GCF_006243415.1| 3.98e-06    100.000 32  32

Upon further investigation of the code, I have identified the true cause of the discrepancy. It appears that during the filtering of the alignment results:

CHERRY_crispr_multihost.py considers both coverage and identity (using an 'AND' condition)https://github.com/KennthShang/CHERRY_crispr_multihost/blob/f6316140ad4fbf09cd5650bbaad5931f8ff3df87/Cherry_multihost.py#L76C9-L76C59, whereas Cherry_single.py only requires either one to be satisfied (using an 'OR' condition)https://github.com/KennthShang/PhaBOX/blob/852454bbc5e9abebe6f451310087bb4ed4c311b2/Cherry_single.py#L314C12-L314C49.

This discrepancy explains why Cherry_single.py predicts more hosts of the CRISPR type. After modifying the condition in Cherry_single.py (OR to AND) and re-running the program, I obtained consistent results with CHERRY_crispr_multihost.py in the CRISPR section, though Cherry_single.py chose the one with the highest confidence. (Both use 0.95 coverage and 95 identity)

Cherry_single.py RES
$cut -f 5 -d , testout2/cherry_prediction.csv|sort |uniq -c
    833 -
    603 CRISPR
   1835 Predict
      1 Type

CHERRY_crispr_multihost.py RES
$wc -l prediction.csv 
604 prediction.csv

I am unsure whether changing the 'OR' condition to an 'AND' condition is necessary, but it is worth noting that other filtering criteria in Cherry_single.py also employ the 'AND' condition. https://github.com/KennthShang/PhaBOX/blob/852454bbc5e9abebe6f451310087bb4ed4c311b2/Cherry_single.py#L270

Thank you for your attention to this matter.

Best regards, Robin

KennthShang commented 5 months ago

As I mentioned, all the experiments in the paper are based on the same criteria in the main program. Thus, it is not recommended to modify the main program of CHERRY. But you can do it at your own risk.

if you want to use the multi-host version, you can also modify the program independently. The logic of this extension version is provided by a colleague.

Line 270 in the main program is used to find the seen phages from the database, which is patched after the paper is released. Thus, we choose a strict threshold.

Best, Jiayu