question about format of future modBAM outputs

sathish-t commented 2 months ago

Hello,

I saw recently in a post that future versions of DNAscent might output modBAM files instead of .detect. I wanted to ask a few questions so that I can anticipate what I need to do for some tools I use downstream of DNAscent:

Are you going to add modification information to the BAM file containing alignment information or are you going to make a BAM file in addition to the alignment BAM file that solely contains modification information?
Modification codes (going from the https://samtools.github.io/hts-specs/SAMtags.pdf file). There is no one letter representation for BrdU or EdU in the standard table of modification codes. I've found the modification field is not too consistent on this point. Like for example I've seen a study use the B tag for BrdU in the mod BAM i.e. T+B to indicate BrdU modifcation. This is not desirable as B means 'not A' in the standard one letter genomic codes. However, there is a standard numeric tag representation in the CheBI codes (https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:472552). I could find no such tag for EdU however. So are you going to use a tag representation like T+472252 for BrdU and the generic T+T for EdU?
Again, the field is not very consistent when it comes to marking missing bases as "missing" rather than "unmodified" in the modBAM files. Like for example T+T? means missing bases are to be regarded as missing whereas T+T. or just plain T+T means missing bases are to be treated as unmodified. I've noticed DNAscent detect does not output probabilities for a small fraction of thymidines per read. So are you going to use the ? notation to show that missing bases are to be regarded as missing?

Thanks!

Regards Sathish

MBoemo commented 2 months ago

It will be a new bam file - each record in it will have passed DNAscent detect's QCs and will have the full alignment with appended MM and ML tags. For the input bam file, it's fine if they have existing MM and ML tags (e.g., from Dorado).

Yes, it's unfortunate this isn't more standardised. DNAscent calls analogues at T positions on the subsequence of the reference that the read maps to, so it's going to be N+e? and N+b? (or - if the strand is reverse). IGV only has one colour for "Other", so this allows it to keep the expected letters of "b" and "e" while letting IGV parse it as 5fU and 5caU, respectively, so users can pick what colours they want.

sathish-t commented 2 months ago

Thanks for your reply! Will the software make both .detect and .modBAM files moving forward? If not, I'd like to request such a feature if possible!

MBoemo commented 2 months ago

Yup. The file extension of the output file that you pass to DNAscent detect will become syntactically meaningful. If you pass output.detect you'll get the human-readable output as before and if you pass output.bam you'll get a bam file with MM and ML tags.

jts commented 2 months ago

Just wanted to chime in with potential clarifications:

(or - if the strand is reverse)

the - indicator in modBAM is for when the modification is on the complementary strand to the basecalled read (e.g. when you have a duplex read), not when the read is reverse w.r.t the reference genome. Not sure if this is what you meant or not, but wanted to point it out anyway.

I think unless a single letter code is defined in the specification you should use ChEBI identifiers (https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:472552)

MBoemo commented 2 months ago

Yes that is what I meant, sorry! That would be ideal but I don't think viewers have very good handling for that unless I'm mistaken.

sathish-t commented 2 months ago

Hi Mike, modkit can convert between tags; I've used it a few times myself but have not tested it exhaustively, like I don't know if it can convert numeric codes to letter codes (https://nanoporetech.github.io/modkit/intro_adjust.html). So if you are concerned just about viewers, then tags can be converted just before a viewing step.

MBoemo / DNAscent

question about format of future modBAM outputs #58