CMU-SAFARI / RawAlign

RawAlign is a real-time raw nanopore read mapper based on the Seed-Filter-Align paradigm as described by Lindegger et al. (https://arxiv.org/abs/2310.05037)
https://arxiv.org/abs/2310.05037
GNU General Public License v3.0
5 stars 1 forks source link

Difficulty aligning raw signals with Kit14 chemistry #3

Open Cuypers-Wim opened 5 days ago

Cuypers-Wim commented 5 days ago

Dear RawAlign Team,

I am facing challenges in aligning raw signals generated using Kit14 chemistry. I ensured that the correct pore model was specified during the indexing process with the following command:

rawalign \
-d PlasmoDB-58_Pfalciparum3D7_Genome.ind \
-p /extern/local_kmer_models/r10_180mv_450bps_9mer/template_r10_9mer.model -t 32 PlasmoDB-58_Pfalciparum3D7_Genome.fasta

Then, I executed this command:

rawalign --dtw-evaluate-chains \
-t 32 -x sensitive PlasmoDB-58_Pfalciparum3D7_Genome.ind *.fast5 > mapping_plasmo.paf

However, I am encountering very low mapping rates: only 30% of reads from my Plasmodium dataset and just 1% from a virus dataset (both in-house datasets) align.

In contrast, when aligning subsets of R9 data from your pre-print included in the repository, at least 80% of reads map, which seems satisfactory (considering not all reads will match the reference genome).

I consider a read ‘unaligned’ if the line for that read in the PAF file contains only '*'.

Is it possible that I am doing something wrong with the commands as outlined above? In case you could examine my data (https://drive.google.com/drive/folders/1bRj_gOfOACqkQADOoJ6y5tAY-wJDAtqW?usp=sharing), I have included a subset of our in-house generated Plasmodium dataset (from our publication: https://journals.asm.org/doi/full/10.1128/mbio.01967-23). I included the RawAlign index file of the Plasmodium reference genome, the output paf files, and some reads in FAST5 format. The reads were originally POD5 files that I converted to FAST5 using ONT’s POD5 toolkit (https://github.com/nanoporetech/pod5-file-format):

pod5 convert to_fast5 "$pod5_files" --output pod5_to_fast5/

It would be extremely helpful if you could help me determine if the issue lies with the commands, or rather the datasets. Additionally, do you have access to any reference dataset known to work well with the latest nanopore chemistry that I could use for comparison?

Thank you for your assistance!

Best regards,

Wim

joellindegger commented 2 days ago

Hi Wim,

We developed RawAlign and optimized its default parameters on R9.4. Other pore models require parameter sweeps, in particular the match bonus and min score thresholds likely need to be chosen differently, copying from the documentation:

--dtw-match-bonus FLOAT     | DTW bonus score per aligned read event (default: 0.4)
--dtw-min-score FLOAT       | DTW minimum alignment score for a candidate to be considered mapped (default: 20.0)

RawHash2 is better optimized for R10 data, and since it now includes most of RawAlign's options, in addition to several improvements to the seeding and chaining stages, we recommend using it for R10 data.

Best, Joel