LooseLab / Icarust

A fully featured MinKNOW simulator for testing read until experiments.
Mozilla Public License 2.0
17 stars 7 forks source link

bascalling simulated reads #26

Open capoony opened 2 months ago

capoony commented 2 months ago

Dear all,

fantastic software, many thanks!

I have a very naive question. After simulating reads for AmpliconSeq data, I would like to basecall the reads with guppy and then use these FASTQ reads to benchmark my downstream pipeline. Unfortunately, I do not get a single read back that passed the quality filtering What are the correct configs for guppy? I fear I am useing the wrong settings here:

######## load dependencies #######

module load ONT/guppy_6.2.1_gpu

######## run analyses #######

guppy_basecaller \
--input_path data_${name}/fast5_pass \
--config dna_r10.4.1_e8.2_400bps_sup.cfg \
--compress_fastq \
--save_path SUP \
-x "cuda:0"

Please find below also the first few lines of the summary file.

filename    read_id run_id  batch_id    channel mux start_time  duration    num_events  passes_filtering    template_start  num_events_template template_duration   sequence_length_template    mean_qscore_template    strand_score_template   median_template mad_template    scaling_median_template scaling_mad_template
Flow1_pass_c97995_1.fast5   002206d6-3aae-48ed-ad5a-2bf5528ba046    c979953ae6ed4445ad3de5a23d1f2a4c    0   2845    1   8.000000    1.643500    1314    FALSE   8.000000    1314    1.643500    659 5.047206    2.888203    101.321495  21.638645   101.321495  21.638645
Flow1_pass_c97995_1.fast5   004c33d0-d1d2-48f9-b0f8-0358f7d187cc    c979953ae6ed4445ad3de5a23d1f2a4c    0   1940    1   4.000000    1.365250    1092    FALSE   4.000000    1092    1.365250    484 7.027078    2.824660    100.736664  21.931059   100.736664  21.931059
Flow1_pass_c97995_1.fast5   007fb85d-538c-4f4d-8f2a-37c3e44fcfb8    c979953ae6ed4445ad3de5a23d1f2a4c    0   1855    1   8.000000    1.842250    1473    FALSE   8.000000    1473    1.842250    781 4.801814    2.935820    100.151840  22.369680   100.151840  22.369680

Thanks a lot,

Martin

mattloose commented 2 months ago

Hi,

I'm fairly certain that @Adoni5 will chim ein with some comments here, but icarust is not going to give you high quality base called data - the simulations are not precise - they are merely good enough to get back mappable data. As a consequence the quality filter is best ignored. I certainly wouldn't use the sup model unless you really want too.... it's not going to improve your data as the signals are not simulated in a way that we would expect them to perform well.

Icarust is designed to test adaptive sampling workflows and real time analysis but you shoudl not trust the qualities of the base calls.

I hope that helps.

Matt

capoony commented 2 months ago

Dear Matt, that indeed helps a lot, many thanks!

best, Martin