Closed jameson-orvis closed 13 hours ago
Hi thanks for letting us know this, the ML tag is not meant to be malformed we will investigate and get back to you
Hi! I am also interested in extracting the methylation information and was wondering if there is anything new about this topic? I realised that you are using the ML:Z: tag but couldnt find any description for this in the SAM format specification. Could you maybe commend why you use this tag and not the ML:B:C?
Just looking at the code, I think you've hit on the error @Henrikkoe. The code was adapted to use pysam from code that directly wrote SAM text. The value of the tag is being created as a string and passed to pysam which gladly writes the string as a Z format tag rather than a B format tag. We will need to change this.
I just quickly changed the tags from ML:Z: to ML:B:C for a test set and got this error for "modkit pileup":
failed to get modbase info for record 55d81b4f-3394-4430-b554-1ada22653566:0477:0802, Bad Input: MN tag length 4874 and seq length 325 don't match
So I think for each alignment within a concatemer the MN tag should be updated according the alignment length.
Thanks @Henrikkoe,
That sounds like a secondary issue, after chopping up the read we need to invalidate the MN
tag. This is an additional tag that was added to catch exactly the situation where a process might have trimmed a read but not taken care to trim MM and ML!
You should be able to test this further for us. The pore_c_py digest
command has an option --remove_tags
which could be added to the nextflow code to drop the tag from the trimmed reads: https://github.com/epi2me-labs/wf-pore-c/blob/master/modules/local/pore-c.nf#L65
We should be able to recalculate and amend MN (worst case we can discard it) for the monomer reads.
I did a small check and the --remove_tags
option seems to remove as well the ML
and MM
tags. Also when specifying the MN
tag.
I may have confused things: yesterday I made a release of pore-c-py that ensures MN is updated correctly. In the process I changed the behaviour to consistently remove all the mod tags if one is removed (because it makes no sense to leave them in a partial state).
I am still a bit confused about the tags that are removed and added at which time during processing. If I understood it correctly then is the input for pore-c-py digest an unaligned bam file with the called base modifications. At this moment in the pipeline there was no alignment and only the digest of the concatemer into monomers. These monomers can be alignments later in the processing and basically representing a partial state/alignment of the concatemer. Then the sequence/alignment length (MN) should be updated according to the new length. This is what I dont see in the bam file entries but I see (except of the ML:Z: to ML:B:C tag label) a correct representation of ML and MM tags. So the ML and MM tags are trimmed correctly. Only all alignments of a concatemer then have a wrong MN tag.
Here a truncated representation of a bam file entry:
653cba6d-c7e7-45bc-a49c-0011e16626e5:0717:1104 MN:i:2198
But I think it should be:
653cba6d-c7e7-45bc-a49c-0011e16626e5:0717:1104 MN:i:387
You are correct: in the current release the MN tag is not being updated correctly (and the ML tag is a string tag not and array tag). We need to make a new release to correct these two issues.
We've corrected the code in the pore-c-py package but not updated wf-pore-c to use that new code.
Ah okay I see, thank you! Do you already have an estimate when the wf-pore-c will be updated?
Hi, this is updated now in the latest release v1.2.0
Thank you so much!!!
Awesome! Thanks
It seems that there is still an issue with a not correctly trimmed ML and MM tag for some reads/alignments.
5c0b5e95-4016-4c3e-9d36-227502762cd4:0099:0462 MN:i:363
correct
5c0b5e95-4016-4c3e-9d36-227502762cd4:0600:0658 MN:i:6040
not trimmed ML and MM, wrong MN tag
5c0b5e95-4016-4c3e-9d36-227502762cd4:2568:2748 MN:i:180
correct
5c0b5e95-4016-4c3e-9d36-227502762cd4:5400:5789 MN:i:389
correct
5c0b5e95-4016-4c3e-9d36-227502762cd4:4169:4457 MN:i:288
correct
etc.
Oh no sorry, I think we missed a capitalisation. Will amend again and let you know when ready.
Hi, this has now been released in version v1.2.2
Ask away!
Hi! Quick question on extracting methylation data from aligned reads. Our raw bams have been methylation called already, so it seems like the easiest way to extract methylation information would be to use modkit on the aligned bams at {alias}.ns.bam. However, using "modkit pileup" gives errors of the form
The code for pore-c-py seems like it is designed to pass modified bam information to the digested aligned output bams, so I am confused as to why this is happening. It may be an issue with the way I am calling modkit, but before I attempt troubleshooting further I want to ask if it should be possible to use modkit to extract methylation data from BAMs aligned from the pipeline, or if there is another preferred I should be processing methylation calls. Thank you!