CMU-SAFARI / RawHash

RawHash is the first mechanism that can accurately and efficiently map raw nanopore signals to large reference genomes (e.g., a human reference genome) in real-time without using powerful computational resources (e.g., GPUs). Described by Firtina et al. (published at https://academic.oup.com/bioinformatics/article/39/Supplement_1/i297/7210440)
https://academic.oup.com/bioinformatics/article/39/Supplement_1/i297/7210440
GNU General Public License v3.0
39 stars 5 forks source link

Is RawHash fast enough for cDNA enrichment? #2

Open andreaswallberg opened 8 months ago

andreaswallberg commented 8 months ago

Dear developers,

This tools looks super interesting! I wonder if you have tried it coupled with Read Until functionality for cDNA or other "short" long reads (e.g. 1-2kbp).

If not, do you think it has the potential to be able to tell whether a read is on or off target against a relatively small database of sequences (e.g. transcriptome or a panel of selected genes) already in the first 100-200 bases?

canfirtina commented 6 months ago

Dear @andreaswallberg,

Thank you for your interest. In recent weeks, we have been working hard to add new features. We are interested in discussing more about what we can improve to provide better support for the cDNA. However, we have not specifically tested RawHash and RawHash2 (a newer version) with cDNA data.

Regarding evaluating 'short' long reads, we have used a dataset, D1, which consists of SARS-CoV-2 sequences. The average read length in this dataset is about 430 bases. These could be considered as 'short' long reads (even shorter based on the range you provided). In our paper, we have described using this D1 dataset alongside the D5 dataset (human genome, average read length 6k bases) for on-/off-target analysis, focusing on contamination analysis. The results show that RawHash2 achieves about 94% precision and 85% recall in this context. From this, we believe RawHash is capable of effectively identifying on-target and off-target reads in scenarios where on-target reads are very short.

We currently do not have a cDNA dataset in our evaluation set. We would be interested in evaluating such a dataset to better tailor RawHash for cDNA applications. If you have any suggestions or feedback, like recommending a dataset that includes signal files and basecalled reads for accurate ground truth mapping, and specifics on the analysis you wish to conduct with RawHash, we would welcome your input.

Best, Can

andreaswallberg commented 6 months ago

Hi @canfirtina !

Sounds good. I can provide such data. I will contact you later this week.

Best regards, Andreas