cutoff or threshold for determining similarity

Dear Jose @PlantHealth-Analytics,

Thank you for your interest and your question. For a quick answer, please jump directly to 2) under 2. Differentiating genomes.

I will answer your question in two parts with more detail. First, I will describe the thresholding mechanism used to determine the best mapping position for a read. Second, I will explain our efforts related to differentiating/classifying closely related genomes.

Thresholding mechanism. RawHash2 identifies the best mapping position for a read by generating a weighted score based on several metrics from the best chain (i.e., the chain with the highest score). If the weighted score exceeds a certain threshold, the read is considered mapped to the position covered by the best chain:

https://github.com/CMU-SAFARI/RawHash/blob/d3956eaf261588555fa884c5c294fd46df591851/src/rmap.cpp#L492-L497

weighted_sum is a weighted summation of several metrics. Currently, we calculate:

1) The ratio of the mean score of all chains to the score of the best chain: https://github.com/CMU-SAFARI/RawHash/blob/d3956eaf261588555fa884c5c294fd46df591851/src/rmap.cpp#L485

2) The ratio of the mean MAPQ of all chains to the MAPQ of the best chain: https://github.com/CMU-SAFARI/RawHash/blob/d3956eaf261588555fa884c5c294fd46df591851/src/rmap.cpp#L484

3) The ratio of the MAPQ of the best chain to 30 (half of the maximum MAPQ, which is 60): https://github.com/CMU-SAFARI/RawHash/blob/d3956eaf261588555fa884c5c294fd46df591851/src/rmap.cpp#L483

These ratios (metrics) are summed in a weighted manner to calculate weighted_sum: https://github.com/CMU-SAFARI/RawHash/blob/d3956eaf261588555fa884c5c294fd46df591851/src/rmap.cpp#L487

For each metric, we assign a corresponding weight to make the thresholding decision (similar to a simple perceptron). The default values of these weights are not shown in the help message (since this is an advanced feature), but you can still modify them using the arguments in the comments here (default values are also provided here):

https://github.com/CMU-SAFARI/RawHash/blob/d3956eaf261588555fa884c5c294fd46df591851/src/roptions.c#L80-L82

Please note that the weights should sum to 1 for the thresholding mechanism to work effectively.

Differentiating genomes. We evaluated relative abundance estimation based on the genomes that reads map to. The Euclidean distance between RawHash2's estimations and the ground truth is the smallest compared to previous tools. However, there are two potential caveats:

1) The mapping procedure is still at the base level. It is computationally expensive as it requires identifying chains (a costly operation) based on seed matches. To differentiate genomes, it should be possible to perform cheaper computations (e.g., seed voting). I tried this but did not succeed on the first attempt. My approach was to differentiate distant genomes by assigning a read to the genome with the highest seed match count. However, more sophisticated filters/selection strategies are likely needed.

2) I focused on differentiating distant genomes, not closely related ones. I am not entirely sure how well RawHash2 performs in such cases. What we do know is that performing DTW alignment after mapping significantly improves the confidence of relative abundance estimation. Although DTW from our earlier RawAlign work has been integrated into RawHash, it needs further optimization for RawHash2. You may want to integrate alignment and evaluate its performance.

I am happy to help if you would like to explore this further.

Thanks, Can

CMU-SAFARI / RawHash

cutoff or threshold for determining similarity #13