PacificBiosciences / pb-CpG-tools

Collection of tools for the analysis of CpG data
BSD 3-Clause Clear License
62 stars 5 forks source link

Read sequence is too short for MM tag offsets in bam record #60

Closed hasindu2008 closed 4 months ago

hasindu2008 commented 8 months ago

Hi there,

I am trying to use pb-CpG-tools-v2.3.2 on a revio dataset. I am pretty new to pacbio data and it is very likely I am making a silly mistake., so apologies in advance.

I used the following Minimap2 command to map the hifi reads and generate a BAM file:

samtools fastq -TMM,ML r84088_20230609_025659/1_A01/hifi_reads/m84088_230609_030819_s1.hifi_reads.default.bam  | minimap2 -ax map-hifi -y /genome/hg38noAlt_hifi.idx - -t12 --secondary=no | samtools sort - -o r84088_20230609_025659_1_A01_meth.bam
samtools index r84088_20230609_025659_1_A01_meth.bam

then I ran:

aligned_bam_to_cpg_scores --model  models/pileup_calling_model.v1.tflite --ref /genome/hg38noAlt.fa --bam r84088_20230609_025659_1_A01_meth.bam

This gives the following error:

[2023-10-26][18:40:21][aligned_bam_to_cpg_scores][INFO] Starting aligned_bam_to_cpg_scores
[2023-10-26][18:40:21][aligned_bam_to_cpg_scores][INFO] cmdline: /install/pb-CpG-tools-v2.3.2/bin/aligned_bam_to_cpg_scores --model /install/pb-CpG-tools-v2.3.2/models/pileup_calling_model.v1.tflite --ref /genome/hg38noAlt.fa --bam r84088_20230609_025659_1_A01_meth.bam
[2023-10-26][18:40:21][aligned_bam_to_cpg_scores][INFO] Running on 32 threads
[2023-10-26][18:40:21][aligned_bam_to_cpg_scores][INFO] Processing alignment file 'r84088_20230609_025659_1_A01_meth.bam'
[00:00:00] [                                        ] Processed         0 of 3,099,922 ref genome kb (0%)
thread '<unnamed>' panicked at '

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!   unwrap! called on Option::None                                             !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/var/lib/bamboo/.cargo/git/checkouts/rust-vc-utils-44210fd7ec9d1030/f42e948/src/bam_utils/basemod.rs:203,39 in rust_vc_utils::bam_utils::basemod
Read sequence is too short for MM tag offsets in bam record: m84088_230609_030819_s1/251859735/ccs

Should I be using pbmm instead of minimap2 for alignment?

ctsa commented 8 months ago

pbmm2 is recommended, but you can also use minimap2 if you add the -Y option. Without this option, minimap2 will create hard-clipped reads which can no longer be matched to the MM/ML methylation tags. The issue is described here:

https://github.com/PacificBiosciences/pb-CpG-tools#input-alignment-file

hasindu2008 commented 8 months ago

Thanks. It worked after the -Y flag. I should have read the documentation more carefully.

I plotted the correlation with bisulphite data and they look like below:

pileup model option:

image

pileup count option:

image

The Pacbio sample is ~20X NA12878 sample we generated on our Revio and the bisulfite data comes from https://www.encodeproject.org/files/ENCFF835NTC/. Are these correlations what you would typically expect?