PacificBiosciences / pb-CpG-tools

Collection of tools for the analysis of CpG data
BSD 3-Clause Clear License
62 stars 5 forks source link

Error: Read sequence is too short for MM tag in record: {qname} #56

Closed glogsdon1 closed 10 months ago

glogsdon1 commented 10 months ago

Hi,

I aligned PacBio HiFi data to an assembly made from the same data using the following parameters: minimap2 -t 8 -I 15G -Y --MD --eqx -ax map-pb. I, then, linked the resulting bam to the methylation data using methylink and the following command: methylink --threads {N} --aln {aligned.bam} --sample {sample_name} --methyl_bams "$(echo {path_to_methyl_bams)" --output {output.bam}. I then used this tool to determine the CpG status with the following command: aligned_bam_to_cpg_scores --bam {output.bam} --output-prefix {prefix} --model $PB_MODEL/pileup_calling_model.v1.tflite --threads {N}.

This worked fine for 2 samples, but I got this error for 3 others:

[2023-08-19][12:17:36][aligned_bam_to_cpg_scores][INFO] Starting aligned_bam_to_cpg_scores [2023-08-19][12:17:36][aligned_bam_to_cpg_scores][INFO] cmdline: aligned_bam_to_cpg_scores --bam AG18354_PTR_hifimethyl2v1.1.bam --output-prefix AG18354_PTR_hifimethyl2v1.1 --model /net/eichler/vol26/7200/software/modules-sw/pb-CpG-tools/2.3.1/Linux/CentOS7/x86_64/models/pileup_calling_model.v1.tflite --threads 20 [2023-08-19][12:17:36][aligned_bam_to_cpg_scores][INFO] Running on 20 threads [2023-08-19][12:17:36][aligned_bam_to_cpg_scores][INFO] Processing alignment file 'AG18354_PTR_hifimethyl2v1.1.bam' [00:00:00] [================================> ] Processed 5,022,534 of 6,150,246 ref genome kb (82%) thread '' panicked at 'Read sequence is too short for MM tag in record: {qname}', /var/lib/bamboo/.cargo/git/checkouts/rust-vc-utils-44210fd7ec9d1030/0db443b/src/bam_meth_info.rs:191:18 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Aborted

Since I included the -Y parameter in the minimap2 command, I was surprised to get this error. Can you give me some advice on how to prevent this?

Thanks, Glennis

ctsa commented 10 months ago

Hi @glogsdon1, Sorry it looks like there's a problem formatting the error message here, we can get that fixed on our end. Without the correct error it is hard to investigate the problem because we can't pull up a problematic read by qname.

I'm not familiar with methylink so it is difficult to speculate about the problem, you are right that minimap2 with the -Y option should take care of the typical problem that would trigger this error message.

Let me get you an updated version with a more helpful error message.

projectoriented commented 10 months ago

Hi @ctsa,

Thanks for developing this! methylink maps the methylation tags from the unmapped bams (that contain such info) to the aligned/target bam through the qname. It does not truncate/elongate/alter the information that comes with the tags. And for the qname that doesn't have methylation meta data, it remains as is.

Looking forward to the updated version so we can speculate the specific read that is causing the problem. Thanks a bunch!

ctsa commented 10 months ago

Thanks, I updated a v2.3.2 release with the error message fix here:

https://github.com/PacificBiosciences/pb-CpG-tools/releases/tag/v2.3.2

This won't do anything to fix the issue above but should help describe an example read qname where the issue is occurring. I've marked this as pre-release for now, depending on what you find we'll either finish this as a release or make further updates.

projectoriented commented 10 months ago

Hi @ctsa, thank you for the changes. Do you think it's worth while to have this as an exception versus an exit on failure? Otherwise, do you have suggestions on how to effectively filter out these erroneous reads? These are the tags associated to the problematic read:

MM:Z:C+m,46,20,26,20,2,12,8,49,4,78,13,66,11,11,8,30,101,26,14,35,61,182,8,99,87,29,0,52,0,37,22,13,9,15,4,104,7,97,109,141,33,22,31,23,44,10,20,24,3,70,7,59,67,81,20,5,39,10,136,108,95,24,81,0,12,6,11,6,32,52,78,27,18,162,33,3,2,27,10,7,43,0,5,22,17,5;

ML:B:C,34,179,48,139,45,3,29,2,56,224,63,6,127,149,214,6,26,59,42,68,115,201,248,3,250,255,254,187,96,69,4,5,44,6,117,41,46,20,58,177,42,155,199,17,67,123,235,43,9,141,78,38,178,32,150,62,32,61,80,164,17,157,214,16,93,17,6,74,5,19,5,1,162,87,10,11,49,5,113,18,2,13,44,100,162,106

Number of nucleotides in the sequence:
14663
ctsa commented 10 months ago

I'd hesitate to downgrade this to some kind of warning -- the reason is that once you have an indication that the methylation data isn't correctly registered to the reads, then you don't know if any of the data is registered correctly -- there will be cases where the MM tag could appear to legally match the read string but would just be randomly assigning methylation data from a different read or strand or offset from the real positions.

For your case, if you haven't already done so, you could go back and check the qname in the unmapped bam, and see that (1) the MM/ML tag matches the one added to the mapped qname record (2) The read in the mapped bam either matches the unmapped record or is an exact revcomp with the reverse mapped flag set.

If the above sanity checks, then it suggests the MM tags in your unmapped bam could be corrupted by something like adaptor trimming or other pre-processing steps.

projectoriented commented 10 months ago

We checked both 1 & 2 and neither are causing the problem. We have a strong suspicion it's the input file we used with corrupted methylation tag because there are other reads with an empty tag (from the unmapped bam) which should not happen at all. Therefore, we will generate our own methylation probabilities versus using pre-shipped ones.

Thanks a bunch for helping 😄 & here's the error code to finalize this issue.

glogsdon@liger:methylation_calls:09:32 AM => aligned_bam_to_cpg_scores \
>   --bam AG18354_PTR_hifimethyl2v1.1.cens.chr3.bam \
>   --output-prefix AG18354_PTR_hifimethyl2v1.1.cens.chr3 \
>   --model $PB_MODEL/pileup_calling_model.v1.tflite \
>   --threads 20
[2023-08-22][09:32:17][aligned_bam_to_cpg_scores][INFO] Starting aligned_bam_to_cpg_scores
[2023-08-22][09:32:17][aligned_bam_to_cpg_scores][INFO] cmdline: aligned_bam_to_cpg_scores --bam AG18354_PTR_hifimethyl2v1.1.cens.chr3.bam --output-prefix AG18354_PTR_hifimethyl2v1.1.cens.chr3 --model models/pileup_calling_model.v1.tflite --threads 20
[2023-08-22][09:32:17][aligned_bam_to_cpg_scores][INFO] Running on 20 threads
[2023-08-22][09:32:17][aligned_bam_to_cpg_scores][INFO] Processing alignment file 'AG18354_PTR_hifimethyl2v1.1.cens.chr3.bam'
[00:00:00] [=================================>      ] Processed 5,081,750 of 6,150,246 ref genome kb (83%)                                                      thread '<unnamed>' panicked at '

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!   unwrap! called on Option::None                                             !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/var/lib/bamboo/.cargo/git/checkouts/rust-vc-utils-44210fd7ec9d1030/f42e948/src/bam_utils/basemod.rs:203,39 in rust_vc_utils::bam_utils::basemod
Read sequence is too short for MM tag offsets in bam record: m64076_210810_005444/31916101/ccs

', /var/lib/bamboo/.cargo/registry/src/github.com-1ecc6299db9ec823/unwrap-1.2.1/src/lib.rs:93:25
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted
ctsa commented 10 months ago

Thanks for passing on those details. Please share if the issue traces back to another part of our software. I'll go ahead and set 2.3.2 as an official release since there isn't an obvious follow-on improvement in MM tag parsing we know of at this time.