Open PlantHealth-Analytics opened 1 week ago
Dear Jose @PlantHealth-Analytics,
Thank you for your interest and your question. For a quick answer, please jump directly to 2) under 2. Differentiating genomes.
I will answer your question in two parts with more detail. First, I will describe the thresholding mechanism used to determine the best mapping position for a read. Second, I will explain our efforts related to differentiating/classifying closely related genomes.
weighted_sum
is a weighted summation of several metrics. Currently, we calculate:
1) The ratio of the mean score of all chains to the score of the best chain: https://github.com/CMU-SAFARI/RawHash/blob/d3956eaf261588555fa884c5c294fd46df591851/src/rmap.cpp#L485
2) The ratio of the mean MAPQ of all chains to the MAPQ of the best chain: https://github.com/CMU-SAFARI/RawHash/blob/d3956eaf261588555fa884c5c294fd46df591851/src/rmap.cpp#L484
3) The ratio of the MAPQ of the best chain to 30 (half of the maximum MAPQ, which is 60): https://github.com/CMU-SAFARI/RawHash/blob/d3956eaf261588555fa884c5c294fd46df591851/src/rmap.cpp#L483
These ratios (metrics) are summed in a weighted manner to calculate weighted_sum
:
https://github.com/CMU-SAFARI/RawHash/blob/d3956eaf261588555fa884c5c294fd46df591851/src/rmap.cpp#L487
For each metric, we assign a corresponding weight to make the thresholding decision (similar to a simple perceptron). The default values of these weights are not shown in the help message (since this is an advanced feature), but you can still modify them using the arguments in the comments here (default values are also provided here):
Please note that the weights should sum to 1 for the thresholding mechanism to work effectively.
1) The mapping procedure is still at the base level. It is computationally expensive as it requires identifying chains (a costly operation) based on seed matches. To differentiate genomes, it should be possible to perform cheaper computations (e.g., seed voting). I tried this but did not succeed on the first attempt. My approach was to differentiate distant genomes by assigning a read to the genome with the highest seed match count. However, more sophisticated filters/selection strategies are likely needed.
2) I focused on differentiating distant genomes, not closely related ones. I am not entirely sure how well RawHash2 performs in such cases. What we do know is that performing DTW alignment after mapping significantly improves the confidence of relative abundance estimation. Although DTW from our earlier RawAlign work has been integrated into RawHash, it needs further optimization for RawHash2. You may want to integrate alignment and evaluate its performance.
I am happy to help if you would like to explore this further.
Thanks, Can
Thank you for providing this tool! I am curious if there is an established cutoff or threshold for determining similarity during analysis. Specifically, have you tested how well the tool can discriminate between closely related genomes in real-time? For example, have there been evaluations using raw signals to differentiate such genomes effectively? Thanks. Jose C. Huguet-Tapia