WGLab / DeepMod2

DNA 5mC methylation detection from Dorado or Guppy basecalled Oxford Nanopore reads
MIT License
32 stars 2 forks source link

KeyError: 'segmentation' after running detect-guppy #5

Closed HippoYI closed 7 months ago

HippoYI commented 1 year ago

Hi, I run deepmod2 with the following command: python ~/softwares/DeepMod2/deepmod2 detect-guppy --bam ./sample.bam --fast5 ./workspac e/ --ref ./genome.FINAL.fasta --output ./deepmod2_output/ --threads 40

But I meet some errors as showed below, and I have no idea what cause this.

2022-10-31 09:44:33.727690: Processing BAM File. 2022-10-31 11:43:03.338479: Finished Processing BAM File. 2022-10-31 11:43:03.338672: Starting Per Read Methylation Detection. 2022-10-31 11:43:43.405921: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Netwo rk Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-10-31 11:43:48.938800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 w ith 22291 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:4b:00.0, compute capability: 8.6 2022-10-31 11:43:48.976388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:1 w ith 22268 MB memory: -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:b1:00.0, compute capability: 8.6 multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/data/user/softwares/miniconda3/envs/deepmod2/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/data/user/softwares/DeepMod2/src/guppy.py", line 151, in detect base_level_data, seq_len = get_read_signal(read, params['guppy_group']) File "/data/user/softwares/DeepMod2/src/guppy.py", line 96, in get_read_signal segment=read.get_analysis_attributes(guppy_group)['segmentation'] KeyError: 'segmentation' """

Can you help me to solve this problem? Thank you!

umahsn commented 1 year ago

Hi,

You might need to rebasecall the FAST5 files using Guppy or resquiggle them using Tombo. It seems like you are using FAST5 files directly from a MinKNOW run which uses fast basecalling and doesn't store move table or segmentation data in FAST5 files. You can confirm whether segmentation or move tables are present using this command:

h5dump -n file_name.fast5

If you decide to run Guppy again make sure to use --fast5_out parameter. You can see an example of running Guppy here: https://github.com/WGLab/DeepMod2/blob/main/docs/Example.md#basecall-fast5-files-and-extract-bam-file

Let me know if this fixes the issue. Also, we uploaded new models so it might be good to pull changes again.

Thank you for using DeepMod2,

Umair

HippoYI commented 1 year ago

Hi, thanks for the reply. Actually, the fast5 files, and the bam file were created from the guppy_basecaller, just like the command you gave in the link. Below was the command I used:

~/softwares/guppy/ont-guppy/bin/guppy_basecaller -i assembly_related_rawData/UL_fast5/fast5_pass -s ./ -c ~/softwares/guppy/ont-guppy/data/dna_r9.4.1_450bps_sup_plant.cfg --device auto --fast5_out --bam_out --align_type full --align_ref genome.FINAL.fasta --num_alignment_threads 30 --num_read_splitting_threads 30

umahsn commented 1 year ago

I see. Can you show me what the result of h5dump -n file_name.fast5 looks like so I can take a look at what the datasets/groups are present? You can show only the first few lines if it is a multi-fast5 file. Also, what version of guppy is this?

umahsn commented 1 year ago

Another potential issue could be that if you rebasecalled without cleaning old basecalling data from MinKNOW, then you will have two basecall groups "Basecall_1D_000" and "Basecall_1D_001". The first basecall group "Basecall_1D_000" from MinKNOW will not have segmentation data, but the second basecall group "Basecall_1D_001" from rebasecalling with Guppy would have segmentation data. In that case, you would need to specify the correct basecall group to DeepMod2 using --guppy_group parameter. You might need to double check which basecall group has what name.

HippoYI commented 1 year ago

Sorry for the delay. I have multi-fast5 files, so belowed were the command and contents: $ h5dump -n 20220513-SUL0079-PAI98022_pass21.fast5|head HDF5 "20220513-SUL0079-PAI98022_pass21.fast5" { FILE_CONTENTS { group / group /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f group /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses group /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses/Basecall_1D_000 group /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses/Basecall_1D_000/BaseCalled_template dataset /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses/Basecall_1D_000/BaseCalled_template/Fastq dataset /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses/Basecall_1D_000/BaseCalled_template/Move group /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses/Basecall_1D_000/Summary

The guppy version is 6.1.2

Thanks!

umahsn commented 1 year ago

Hi,

Can you try h5dump -g /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses/Basecall_1D_000 20220513-SUL0079-PAI98022_pass21.fast5|grep segmentation -A 10 to check if there is any segmentation attribute in the read you mentioned?

Ideally this should show an attribute names segmentation which indicates which would point to the name of segmentation group like this. image

This segmentation group also be located in /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses/Segmentation_000. Can you also show the output of h5dump -g /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses/Segmentation_000 20220513-SUL0079-PAI98022_pass21.fast5?

HippoYI commented 1 year ago

Hi, below were the resutls: $ ~/software/hdf5-1.8.17-linux-centos7-x86_64-gcc485-static/bin/h5dump -g /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses/Basecall_1D_000 20220513-SUL0079-PAI98022_pass21.fast5|grep segmentation -A 10 ATTRIBUTE "segmentation" { DATATYPE H5T_STRING { STRSIZE 17; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "Segmentation_000" }

$ ~/software/hdf5-1.8.17-linux-centos7-x86_64-gcc485-static/bin/h5dump -g /read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses/Segmentation_000 20220513-SUL0079-PAI98022_pass21.fast5 HDF5 "20220513-SUL0079-PAI98022_pass21.fast5" { GROUP "/read_00060bdb-1a1a-4f65-9b02-6150a59c3f0f/Analyses/Segmentation_000" { ATTRIBUTE "component" { DATATYPE H5T_STRING { STRSIZE 13; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "segmentation" } } ATTRIBUTE "name" { DATATYPE H5T_STRING { STRSIZE 32; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "ONT Guppy basecalling software." } } ATTRIBUTE "time_stamp" { DATATYPE H5T_STRING { STRSIZE 21; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "2022-11-08T03:09:04Z" } } ATTRIBUTE "version" { DATATYPE H5T_STRING { STRSIZE 14; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "6.1.2+e0556ff" } } GROUP "Summary" { ATTRIBUTE "return_status" { DATATYPE H5T_STRING { STRSIZE 20; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "Workflow Successful" } } GROUP "segmentation" { ATTRIBUTE "adapter_max" { DATATYPE H5T_IEEE_F32LE DATASPACE SCALAR DATA { (0): 0 } } ATTRIBUTE "duration_template" { DATATYPE H5T_STD_U64LE DATASPACE SCALAR DATA { (0): 17640 } } ATTRIBUTE "first_sample_template" { DATATYPE H5T_STD_U64LE DATASPACE SCALAR DATA { (0): 105 } } ATTRIBUTE "has_complement" { DATATYPE H5T_STD_I32LE DATASPACE SCALAR DATA { (0): 0 } } ATTRIBUTE "has_template" { DATATYPE H5T_STD_I32LE DATASPACE SCALAR DATA { (0): 1 } } ATTRIBUTE "med_abs_dev_template" { DATATYPE H5T_IEEE_F32LE DATASPACE SCALAR DATA { (0): 9.50346 } } ATTRIBUTE "median_template" { DATATYPE H5T_IEEE_F32LE DATASPACE SCALAR DATA { (0): 72.738 } } ATTRIBUTE "num_events_template" { DATATYPE H5T_STD_U64LE DATASPACE SCALAR DATA { (0): 3528 } } ATTRIBUTE "pt_detect_success" { DATATYPE H5T_STD_I32LE DATASPACE SCALAR DATA { (0): 0 } } ATTRIBUTE "pt_median" { DATATYPE H5T_IEEE_F32LE DATASPACE SCALAR DATA { (0): 0 } } ATTRIBUTE "pt_sd" { DATATYPE H5T_IEEE_F32LE DATASPACE SCALAR DATA { (0): 0 } } } } } }

umahsn commented 1 year ago

Hi,

Sorry for the delay in reaching back. It seems like the read you have shown has segmentation data so most likely what has happened is that there are some reads that do not have it and those particular reads are giving this error. If thats the case then you might see some methylation calls before the error. Can you confirm that? I am going to fix the code so that it can skip the reads that do not have this data instead of crashing, and optionally allow users to see which reads failed due to lack of segmentation data.

umahsn commented 1 year ago

Also, can you tell me which commit of the repository you are using? The code lines from the error output you posted don't seem to match code lines in current commit. It seems like there already is a KeyError exception in the current code.