edwardslab-wustl / dxm

Deconvolution of Methylation Sequencing Data
GNU General Public License v3.0
2 stars 2 forks source link

Some questions about using DXM #4

Open powerhorse1986 opened 2 years ago

powerhorse1986 commented 2 years ago

Hi,

Our group is trying to use DXM to analyze some BS datasets. However, we found that there are some questions need to be clarified for successfully run the software. And here are the questions:

  1. For command "dxm_solveMethylation", it seems there are two files, the Bins.txt and the Freqs.txt, are required. May I know how did you generate these two files? And, are the necessary for running the command successfully?
  2. We are trying to run DXM on the DLBCL datasets. For the DLBCL, there are two datasets were mentioned in the paper, GSE130556 which contains 31 plain text files, GSE66329 which contains 4 BED files. May I know which dataset was used as input?
  3. If GSE130556 was used as input, may I know how to convert them into the desired BED files with proper position1 and position2 values? That's all our questions. Looking for your response.

Thanks. Li

jredwards417 commented 2 years ago

Hi Li,

Thanks for your interest in DXM!

(1) These files relate to the transition probabilities and distance bins used in the modified HMM that models the short-range distance correlations in DNA methylation data. See the sections "Modified hidden markov model" and "HMM transition probabilities" for a deeper description of the HMM and how we trained the model in our paper. In practice we have found that the provided models work well across a range of samples as long as they show regional correlations like the samples we plotted in the paper, which is true for most human and mammalian samples. I suppose if you look at something extremely different like methylation in E.coli, or a cell line with severely depleted methylation these models might not work and need to be re-trained.

(2) I believe GSE66329 is the one with 31 samples from Pan et al. The other 4 in GSE130556 are from our group. We used both data sets in the paper, but for different analyses. It should be pretty clearly labelled when we use each in the text and figure captions, but if you have questions about particular analyses, let me know.

(3) For GSE66329 (the one with 31 samples) you have to massage the input a bit. You should be able to use a simple script to extract the chromosome, coordinates, methylation, and coverage into a bed-like file format and then convert it using the instructions here. We use position1 as the C in the CpG, and then set position2 = position1 + 1. I think the other trick is that you might consider collapsing the methylation values across strands since I think they report values for each strand.

John