aedera / m6anormalization

Calculate k-mer constants to normalize m6a levels inferred from DNA Nanopore reads
0 stars 0 forks source link

How to determine whether an enzyme is sequence-biased #1

Closed ZFF00 closed 1 month ago

ZFF00 commented 1 month ago

Dear author,

 I have a small question. How to use UMAP to analyze the differences in methylation levels of different kmer sequences, that is, Figure 1C in the article. I have calculated the methylation level of each kmer. How to determine whether my data has sequence preference? Could you provide relevant codes? 

Thank you very much!

aedera commented 1 month ago

Hello,

As shown in Fig. 1C, UMAP can be used to project k-mer sequences along with their methylation levels. To achieve this, the k-mer sequences should first be transformed into vector representations by encoding each nucleotide as an integer (A/0, C/1, G/2, T/3). This results in a matrix, X, where each row corresponds to a k-mer sequence, and the columns represent the integer-encoded k-mer sequences combined with their associated methylation levels. This matrix can then be projected to a 2D space using UMAP. The presence of clusters in the projections may indicate potential sequence preferences.

Below is the python code used for generating the Fig. 1C, with X being the matrix describe above:

import numpy as np
import matplotlib.pyplot as plt
import umap

# Data normalization. 
Xcen  = X - np.mean(X, axis=0)
Xnorm = Xcen / np.std(Xcen, axis=0)
meth_levels = X[:,-1] # The last column contains the methylation level of each k-mer sequence

# Project to a 2D space
reducer  =  umap.UMAP()
projections = reducer.fit_transform(Xnorm)

# Visualization
fig, axs = plt.subplots(figsize=(3,3))
axs.scatter(
  projections[:,0], 
  projections[:,1], 
  s=.1, 
  edgecolor='none', 
  c=meth_levels, 
  cmap=plt.get_cmap('magma'), 
  rasterized=True
)

Best,