Methylation call for ambig_bam

Hi Felix,

This is follow-up of my previous question about absence of methylation call in ambig_bam output. So I was so interested in analyzing methylation status of repeat sequence region and the majority of multimapper reads were assumed to be mapped to the repetitive regions. I wanted to somehow generate methylation call information from the bismark ambig_bam output in which multimapper reads are already mapped to one of their mappable loci in a random manner (which makes sense to me). But I tried several ways I could try but failed to finally make it.

Here are two questions.

Was there a specific intention that you didn't add methylation call in the ambig_bam? I guess that if there was any, it might be because it may finally end up in too less meaningful results due to too low complexity coming from sequence repetitiveness plus C>T bisulfite conversion of the reference sequence? (Even if that is true, I wanted to give it a shot.)
Could there be any hint, trick or comment that I may find a way to derive methylation call from the bismark ambig_bam output?

Thank you so much!

Jinkil

Hi Jinkil,

The reason we didn't want to add methylation calling to ambiguous reads is that people would simply want to have it, because there seems to be a widespread mindset of "the more, the better". We personally never wanted to go down this route and rather be a bit more conservative it by saying: if we don't know where a read really comes from, we don't want to infer its methylation state. If you have a specific question about e.g. a class of repeats you could still take the approach of using non self-repetitive sequences, or even using consensus sequences. But I am afraid we won't be starting to enable methylation calls from ambiguous reads, simply because everyone would be using them even though they probably shouldn't.

If you don't want to go down the route of enabling the methylation call procedure yourself, I think you could try MethylDackel which I believe works on a pileup based principle. Sorry if I can't offer more help right now..

Thank you for your response, Felix!

I understand what you were concerned if you allowed methylation calling to the ambiguous reads. My goal for analyzing methylation status of repetitive sequences was kind of a bit loose but not totally cutoff-free. I would not care how many loci a certain ambiguous read is mappable to but I would care if the mappable loci of a certain ambiguous read has a common feature, for example, if a given ambiguous read is mappable to 5 different loci AND if all of the 5 different mappable loci is, let say, ERV type sequences, then I will take its methylation information and use it to claim methylation status of the "bulk" ERV type repetitive sequences. It is kind of a similar idea to how repeat sequence transcripts are quantified as bulk from RNA-seq data, although there is some critical difference between RNA-seq and BS-seq. Mapping ambiguous RNA-seq reads (with a sufficiently long length) intrinsically has some level of confidence that (most of) multiple mappable loci of a certain ambiguous read are likely to be the same type of repetitive sequence. However, bisulfite-converted ambiguous reads may often have too low complexity so that I would need some reasonable filtering to ensure that a certain ambiguous bisulfite-seq read is mappable to only a single type of repetitive sequences. Do you think this idea is reasonable? I'm not sure if such filtering might drastically reduce the number of useful ambiguous reads though.

The main reasons I hesitated using consensus reference of repeat sequence families were that (i) I wanted to analyze all existing repeat sequences rather than focusing on a few specific repeat families and (ii) I was concerned if the result is biased (likely as a form of subsampling) due to sequence deviation existing in most of repeat sequence families in diverse patterns.

And thank you for introducing MethylDackel! I have not tried it yet but it seems like very appropriate for my trial analysis.

We have in the past tried several different strategies for repeat. This went on over several years and different organisms, and I am probably not the best person to say what actually works best (I could let you know who worked on this mainly though). I can still share a few thoughts on this, possible options are:

Aligning data to the genome as usual, and analyse portions of the alignments that overlap with repeats. We would typically use the RepeatMasker tracks for this, and simply import different repeat classes into SeqMonk, which could then handle all the overlaps and do the quantitations. A huge proportion of the genome is classified as 'repetitive element' of some description, but probably the majority of these annotations are in fact mappable, or at least not so repetitive that they will get kicked out from genomic alignments

This has in the past lead to the observation that the methylation levels of the mappable portion of certain repeat classes and repeat consensus sequence (see below), which are at least in part repetitive, appear different to each other. The reason is probably that border or edge regions of repeats are probably indeed behaving somewhat differently to the conserved repetitive portions, and you will have to accept that these approaches may not perfectly agree with each other, but they also address somewhat different questions...

For some projects we use indeed only a few consensus sequences, and map them either separately or against several consensus sequences together, e.g.:

This script runs Bismark against the following repeat consensus sequences as one genome: IAP, L1A, L1Tf, SINEB1, Major satellites, Minor satellites, MuERV-L, MuLV LTR and EtnI LTR

For some projects this has worked really well, and if you compare say one condition versus another one, an approach like this can show whether there are general effects.

Very generally, the problem is that if you work with very generalised categories, e.g: LTRs, LINE, SINE, major satellites... you may get a general idea about their methylation state, but you might miss subtle changes that occur in families, or even within only certain elements within families. On the other hand, if you start off with dozens or even thousands of elements, things can get very complicated and messy very quickly.

As a concrete example, when we used simply a consensus of LTRs together we missed differences in MuERV-L, MuLV LTR and EtnI LTR elements, but these families were vitally changed in some biological situations.

And lastly, for some human studies led by Ferdinand von Meyenn in the past, we used a set of all all consensus sequences for certain repeat classes (ERV_L1_SINE_Satellites_LTR) in the human genome, and used these sequences as repeat genomes. In the case of ERVs, a file would look like this:

>ERV3-16A3_I    ERV3    Homo sapiens
catgaatggtgccgggagtggtctgtagaaarcaggcgrtaagatggggttttgggactggctcactcac
cgcttagtggacagaggatacyatcctgatacaaggtggcggcccagttgrtaaaaatttcactggtggt
...
>ERV3-16A3_LTR  ERV3    Homo sapiens
tgtggcagccacggaggcgcgccgctcggatctcccttcaagaaagaacttgccgttcagccgcaaggag
tgcggttagctgacagcctccagctgctagcaccttcaggatccgcctcagctttcgagccgaggccacg
...
>ERVL   ERV3    Homo sapiens
aattttggtactaacagtgtatagaggaacaaaattttaaggatgggttttctgaattggtttcggggat
ttggtaattggctgccaaatmtggttagawttaaggatgctaatkaccctatttccagtagtaaagagag
...
>HARLEQUIN  ERV1    Homo sapiens
tttcttggttccctgaccaggaagcgaggtgattaacggacggtcgaggcagccccttaggcggcttagg
cctgccctgtggagcatccctgcrggggactccagccagcytgagtgacgcggatcctgagagcgctccc
...

These files could then be imported into SeqMonk where each repeat element of a family was regarded as a individual chromosome. From this it is very easy to see which of these elements i) were present, ii) had uniquely mappable regions and iii) whether the methylation levels were different across different repeat elements. Similar to genomic alignments, perfectly repetitive regions would be kicked out in this such cases. We found that a combination of this, together with more focused single-consensus mappings was able to answer lots of questions.

If you have any questions about this, feel free to pester me further...

Hi Felix,

Thank you so much for the detailed explanations for each different ways of aligning the reads to repeat sequences.

What I first tried was essentially identical to the first option you listed. And as you said, it works pretty well for some of the repeat sequence families which have a longer and/or less repetitive pattern of repeat unit. And I already got a kind of satisfying differential methylation results for those repeat families in contexts of my biological conditions. So it was great.

But what I want to further do for now is somehow mapping to the other unfortunate repeat families which may have shorter and too regularly repetitive sequences. The following is the Dfam consensi of some of the repeat families that I want to try to analyze.

HSATI_consensus ttgggggtgccctatttcccatctcataacttattttaagaagcacagcataataatgtgtgggcttgggattcagtttttgaaacaaaacactgagccttcgatgaccttcctgtacntgtaaaagcacacctgtctgcatggcagcagttggacctcacagtgtggattgtgccttcaccctggaatgtttatgccctatcgccatggtgatgggattagggatctcctgcccttggtcctaagtgccactatctgtgctgagtttttcaaaggtcagagcagattgaaccnttgtggtttcattttccctgattttgatttttcttatggggaacctgtgtggctgcattcaaggtatgttcatactggcctgtcaaatgcgatcttttcaaattactagttaatgctttcaaaatatgttatttaaaaaattagcctctgtattttccatatgcagttataaatatgtttcatggttatgttttattcctcaatttatatatttgattattgtaccaagcagagtacctttgaaatttttcttcatttaaaaaatatgtatctt HSATII_consensus ccattcgattccattcgatgattccattcgattccattcgatgatgattccattcgattccattcgatgattccattcgattccattcgatgatgattccattcgattccattcgatgattccattcgattccattcgatgatgattccattcgattccattcgatgatt SAR_consensus acagtatataatatgtatttggtgtactttgatattttatgtacagtatataatatgtatttggtgtactttgatattttatgt SATa_consensus cattctcagaaacttctttgtgatgtgtgcattcaactcacagagttgaaccttccttttgatagagcagttttgaaacactctttttgtagaatctgcaagtggatatttggagcgctttgaggcctncggtggaaaaggaaatatcttcacataaaaactagacagaag

Among them, HSATI and SATa even have 2 Ns and 1N, respectively and I have no idea how that adversely affects bismark aligning. But anyway, I tried the second option you said using a bismark index generated with the above consensi. The result was like, SATa and HSATI had 5-10 reads uniquely mapped on them out of ~50 million reads data. And HSATII had ~2,000 reads ambiguously mapped from the same data. SAR none.

Low number of alignment might be because this bisulfite-seq data is RRBS which is dependent on MspI cleavage and enrichment. But the ambiguous mapping to HSATII seems to be kind of so hopeful that I may further try to quantify them. Do you think I can use methylDackel to quantify them by getting bismark's ambig.bam as its input?

Thank you!

Jinkil

Hi Junkil,

I agree that the RRBS might by itself either be informative for some sequences, or it might not. Prerequisite here would be that you have CCGG...CCGG fragments within the correct selected (short) fragment size, so either within the repeat sequence itself of directly surrounding it.

The consensus mapping only really works as long as the consensus are not self-repetitive, otherwise you will again end up with situations where the reads may not be placed uniquely. Here is an example of two of the sequences you linked:

HSATII_consensus
ccattcgattccattcgatgatt
ccattcgattccattcgatgatgattccattcgattccattcgatgatt
ccattcgattccattcgatgatgattccattcgattccattcgatgatt
ccattcgattccattcgatgatgattccattcgattccattcgatgatt

SAR_consensus
acagtatataatatgtatttggtgtactttgatattttatgt
acagtatataatatgtatttggtgtactttgatattttatgt

If you were to reduce the tandem repeat to a single repetitive region, it should work:

SAR_consensus
acagtatataatatgtatttggtgtactttgatattttatgt

The only other thing you need to think about here is the read length: End-to-end alignments require the reference sequence to be at least equally as long as read, otherwise the alignment will be invalid. Options here are to

either shorten the read sequence to the length of the consensus (best 'pad' the Consensus sequence with 2 Ns on either side so that the methylation can take place, like so:

Consensus:
NNacagtatataatatgtatttggtgtactttgatattttatgtNN

Read:
  RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR

with R being the Read (single-end alignments are fine for this)

alternatively, you could try to pad the consensus sequences with further Ns so that the reads can match:

 Consensus:
NNNNNNNNNNNNNNNNNNNNNNacagtatataatatgtatttggtgtactttgatattttatgtNNNNNNNNNNNNNNNNNNNNNN

Read:
  RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR

Either of these approaches should work, but you need to keep in mind that mismatches to Ns count as a penalty of -1 (not -6 like for a full mismatch), so you might have to relax --score_min L,0,-0.2 to e.g. `-0.4 or so.

I'd be happy to give it a try over here for a certain favourite repeat class, if you wanted to send me a subsample of your data... (I could provide an FTP server for this if required).

To be honest I have never used Methyldackel myself so I don't exactly know what it does to be honest...

Cheers, Felix

I'll just close this issue for the time being, let me know if you'd like to follow this up in some way.

FelixKrueger / Bismark

Methylation call for ambig_bam #367