Extracting methylation data from aligned BAMs

jameson-orvis commented 5 months ago

Ask away!

Hi! Quick question on extracting methylation data from aligned reads. Our raw bams have been methylation called already, so it seems like the easiest way to extract methylation information would be to use modkit on the aligned bams at {alias}.ns.bam. However, using "modkit pileup" gives errors of the form

failed to get modbase info for record 1977e594-f368-4232-a0de-cb0dea59e5b6:0079:0393, Bad Input: ML tag malformed

The code for pore-c-py seems like it is designed to pass modified bam information to the digested aligned output bams, so I am confused as to why this is happening. It may be an issue with the way I am calling modkit, but before I attempt troubleshooting further I want to ask if it should be possible to use modkit to extract methylation data from BAMs aligned from the pipeline, or if there is another preferred I should be processing methylation calls. Thank you!

sarahjeeeze commented 5 months ago

Hi thanks for letting us know this, the ML tag is not meant to be malformed we will investigate and get back to you

Henrikkoe commented 4 months ago

Hi! I am also interested in extracting the methylation information and was wondering if there is anything new about this topic? I realised that you are using the ML:Z: tag but couldnt find any description for this in the SAM format specification. Could you maybe commend why you use this tag and not the ML:B:C?

cjw85 commented 4 months ago

Just looking at the code, I think you've hit on the error @Henrikkoe. The code was adapted to use pysam from code that directly wrote SAM text. The value of the tag is being created as a string and passed to pysam which gladly writes the string as a Z format tag rather than a B format tag. We will need to change this.

Henrikkoe commented 4 months ago

I just quickly changed the tags from ML:Z: to ML:B:C for a test set and got this error for "modkit pileup":

failed to get modbase info for record 55d81b4f-3394-4430-b554-1ada22653566:0477:0802, Bad Input: MN tag length 4874 and seq length 325 don't match

So I think for each alignment within a concatemer the MN tag should be updated according the alignment length.

cjw85 commented 4 months ago

Thanks @Henrikkoe,

That sounds like a secondary issue, after chopping up the read we need to invalidate the MN tag. This is an additional tag that was added to catch exactly the situation where a process might have trimmed a read but not taken care to trim MM and ML!

You should be able to test this further for us. The pore_c_py digest command has an option --remove_tags which could be added to the nextflow code to drop the tag from the trimmed reads: https://github.com/epi2me-labs/wf-pore-c/blob/master/modules/local/pore-c.nf#L65

We should be able to recalculate and amend MN (worst case we can discard it) for the monomer reads.

Henrikkoe commented 4 months ago

I did a small check and the --remove_tags option seems to remove as well the ML and MM tags. Also when specifying the MN tag.

cjw85 commented 4 months ago

I may have confused things: yesterday I made a release of pore-c-py that ensures MN is updated correctly. In the process I changed the behaviour to consistently remove all the mod tags if one is removed (because it makes no sense to leave them in a partial state).

Henrikkoe commented 4 months ago

I am still a bit confused about the tags that are removed and added at which time during processing. If I understood it correctly then is the input for pore-c-py digest an unaligned bam file with the called base modifications. At this moment in the pipeline there was no alignment and only the digest of the concatemer into monomers. These monomers can be alignments later in the processing and basically representing a partial state/alignment of the concatemer. Then the sequence/alignment length (MN) should be updated according to the new length. This is what I dont see in the bam file entries but I see (except of the ML:Z: to ML:B:C tag label) a correct representation of ML and MM tags. So the ML and MM tags are trimmed correctly. Only all alignments of a concatemer then have a wrong MN tag.

Here a truncated representation of a bam file entry: 653cba6d-c7e7-45bc-a49c-0011e16626e5:0717:1104 MN:i:2198 But I think it should be: 653cba6d-c7e7-45bc-a49c-0011e16626e5:0717:1104 MN:i:387

cjw85 commented 4 months ago

You are correct: in the current release the MN tag is not being updated correctly (and the ML tag is a string tag not and array tag). We need to make a new release to correct these two issues.

We've corrected the code in the pore-c-py package but not updated wf-pore-c to use that new code.

Henrikkoe commented 3 months ago

Ah okay I see, thank you! Do you already have an estimate when the wf-pore-c will be updated?

sarahjeeeze commented 3 months ago

Hi, this is updated now in the latest release v1.2.0

jameson-orvis commented 3 months ago

Thank you so much!!!

Henrikkoe commented 3 months ago

Awesome! Thanks

Henrikkoe commented 3 months ago

It seems that there is still an issue with a not correctly trimmed ML and MM tag for some reads/alignments.

5c0b5e95-4016-4c3e-9d36-227502762cd4:0099:0462 MN:i:363 correct 5c0b5e95-4016-4c3e-9d36-227502762cd4:0600:0658 MN:i:6040 not trimmed ML and MM, wrong MN tag 5c0b5e95-4016-4c3e-9d36-227502762cd4:2568:2748 MN:i:180 correct 5c0b5e95-4016-4c3e-9d36-227502762cd4:5400:5789 MN:i:389 correct 5c0b5e95-4016-4c3e-9d36-227502762cd4:4169:4457 MN:i:288 correct etc.

sarahjeeeze commented 3 months ago

Oh no sorry, I think we missed a capitalisation. Will amend again and let you know when ready.

sarahjeeeze commented 3 months ago

Hi, this has now been released in version v1.2.2

epi2me-labs / wf-pore-c

Extracting methylation data from aligned BAMs #63

Ask away!