Psy-Fer / buttery-eel

The buttery eel - a slow5 guppy/dorado basecaller wrapper
MIT License
34 stars 2 forks source link

Add model version to output #41

Closed Psy-Fer closed 1 week ago

Psy-Fer commented 1 month ago

When --config dna_r10.4.1_e8.2_400bps_5khz_modbases_5mc_sup_prom.cfg is given as the model, this doesn't say what the model version is for matching with clair3 variant calling models.

To fix this, I should pull the data out of the cfg file used and put it into the fastq or the PG tags in the sam

in the cfg file it is under dorado_model_path

Probably need to do the modbase one too just in case

# Basecalling.
model_file                          = template_r10.4.1_e8.2_400bps_5khz_sup.jsn
dorado_model_path                   = dna_r10.4.1_e8.2_400bps_sup@v4.2.0
remora_models                       = dna_r10.4.1_e8.2_400bps_5khz_modbases_5mc_cg_sup.jsn
dorado_modbase_models               = dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC@v2

Only mod config files have remora_models and dorado_modbase_models

# Basecalling.
model_file                          = template_r10.4.1_e8.2_400bps_5khz_sup.jsn
dorado_model_path                   = dna_r10.4.1_e8.2_400bps_sup@v4.2.0
chunk_size                          = 2000
gpu_runners_per_device              = 12
Psy-Fer commented 1 month ago

New dorado-server config file structure has changed since 7.3.9 onwards

It now looks more like this

dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_hac.cfg

# Basic configuration file for ONT basecaller software.

# Basecalling.
dorado_model_path                   = dna_r10.4.1_e8.2_400bps_hac@v4.1.0
dorado_modbase_models               = dna_r10.4.1_e8.2_400bps_hac@v4.1.0_5mCG_5hmCG@v2

# Calibration strand detection
calib_reference                     = lambda_3.6kb.fasta
calib_min_sequence_length           = 3000
calib_max_sequence_length           = 3800
calib_min_coverage                  = 0.6

# Output.
min_qscore                          = 9.0

dna_r10.4.1_e8.2_400bps_5khz_hac.cfg

# Basic configuration file for ONT basecaller software.

# Compatibility
compatible_flowcells                = FLO-MIN114,FLO-FLG114,FLO-PRO114,FLO-PRO114M
compatible_kits                     = SQK-LSK114,SQK-LSK114-XL,SQK-ULK114,SQK-RAD114,SQK-PCS114
compatible_kits_with_barcoding      = SQK-NBD114-24,SQK-NBD114-96,SQK-RBK114-24,SQK-RBK114-96,SQK-RPB114-24,SQK-MLK114-96-XL,SQK-16S114-24,SQK-PCB114-24

# Basecalling.
dorado_model_path                   = dna_r10.4.1_e8.2_400bps_hac@v4.3.0

# Calibration strand detection
calib_reference                     = lambda_3.6kb.fasta
calib_min_sequence_length           = 3000
calib_max_sequence_length           = 3800
calib_min_coverage                  = 0.6

# Output.
min_qscore                          = 9.0
Psy-Fer commented 1 month ago

Looks like the way they expose the mode to the the API now includes the model version correctly, so no need to read the config file

for an example in a fastq file output

@74c57b9f-6ec7-4cc6-8384-c0df0d5e7f82 parent_read_id=74c57b9f-6ec7-4cc6-8384-c0df0d5e7f82 model_version_id=dna_r10.4.1_e8.2_400bps_fast@v4.3.0 mean_qscore=13

So now I need to do this for sam....RG tags?

Psy-Fer commented 4 weeks ago

I can get modbase onese using this tag in the basecaller output

modbase_model_version_id

Psy-Fer commented 3 weeks ago

Modbase model seems to only be exposed at the read level.

I'm going to add this to the TODO pile because that will require getting the first read, and triggering the header writes before writing the first read, rather than when the writer is spawned.

new header looks like this where the DS tag has basecall_model and the model version dna_r10.4.1_e8.2_400bps_fast@v4.3.0

@HD VN:1.5 SO:unknown @PG ID:basecaller PN:ont basecaller VN:7.4.12 @PG ID:wrapper PN:buttery-eel VN:0.4.3 CL:buttery-eel --guppy_bin /home/jamfer/Downloads/ont-dorado-server-7.4.12/bin/ --config dna_r10.4.1_e8.2_400bps_5khz_fast.cfg -x cuda:0 -i small.blow5 -o test-7413.sam --port auto --use_tcp DS:ont basecaller wrapper basecall_model=dna_r10.4.1_e8.2_400bps_fast@v4.3.0