al2na / methylKit

R package for DNA methylation analysis
https://bioconductor.org/packages/release/bioc/html/methylKit.html
214 stars 96 forks source link

If there is a parameter which can be used to determine how many CpG site (not Coverage on one CpG site) will be tiled together ? #291

Closed DengEr-1993 closed 1 year ago

DengEr-1993 commented 1 year ago

Hi @alexg9010 ,

I have a small question.

I did almost the whole DMR calling workflow and it works well.

Here is my partial code:

myobj = methRead(file.list,
                     sample.id = as.list(vec_names_selected),
                     assembly  = "hg38",
                     treatment = c(rep(1, vec_n_Group1), rep(0, vec_n_Group2)),
                     context   = "CpG",
                     mincov    = 1
    )

 tiles = tileMethylCounts(myobj,
                               win.size = 500, 
                               step.size = 500, 
                               mc.cores = 30,
                               cov.bases = 5)

 filtered_tiles = filterByCoverage(tiles, lo.count = 5)

meth_all_tiles = methylKit::unite(filtered_tiles, mc.cores = 30, min.per.group = 2L)

For example, if I used only 5 in tileMethylCounts() and filterByCoverage(),

what's the different meaning between the two functions ?

If the parameter in the second function means remove CpG site counts lower than 5 ?

But it may be wrong according the help in R.

lo.count
An integer for read counts.Bases/regions having lower coverage than this count is discarded

Here it mostly means lo.count still calculate the Coverage reads.

If so, there must be difference between the two functions.

So I guess, in tileMethylCounts(), from CpG to tile, here I used 500bp,

Does it means in 500bp, all the CpG reads do a sum as the total region reads, or mean reads of all CpG reads in 500bp ?

I am really confused.

Because if it is right (the last sentence above),

lo.count in tileMethylCounts() must be a standard which is used to remove some tiles which is transfered by CpGs in low quality.

For exmaple , in a specific 500bp region, there are 5 CpG sites, contains 5, 6, 9, 7, 8 reads coveraged, respectively.

So in this 500bp region , how many CpG Coverage is there ? sum() 30 reads or mean() 7 reads ?

I hope you can give me a clear explaination when you are free.

Thank in advance.

alexg9010 commented 1 year ago

Hi @DengEr-1993

In tileMethylCounts, after tiles are constructed, the counts are first summed up over the tiles and then tiles with low coverage are discarded, please see here for what cov.bases in tile@ MethylCounts does internally.
FilterbyCoverage, as you already noticed, removes sites that have lower coverage than requested.

Essentially, doing filterbyCoverage after the tiling with the same value for cov.bases/lo.counts should not change the result. But doing the filtering beforehand to remove CpGs with coverage below 5 will give you a different result and should be preferred.

Concerning your example, the coverage is summed up. You should easily see what the coverage is by inspecting your resulting objects, e.g. compare coverage of tiles[[1]][1] and selectByOverlap(myobj[[1]], granges(tiles[[1]][1]).

Best, Alex

DengEr-1993 commented 1 year ago

Thank you very much. You mean here I did it in a wrong workflow ? So I should do filterbyCoverage before the tiling ?

You are right if I do it after tiling, it changed nothing.

So the result is quite different in counts or in quality if I do filterbyCoverage then ?

If not, can I ignore this problem ?

Thank you.

alexg9010 commented 1 year ago

Since you used methRead with min.cov=1 this is will include bases with very low quality/ not reliable methylation calls. This is why I would recommend including the filtering before the tiling.

If you are dealing with low coverage data ( generate from low-input material) I would suggest to require at least a coverage of 3 reads per base.

DengEr-1993 commented 1 year ago

Hi @alexg9010 ,

I updated my script according to your advice,

Firstly, input with a coverage of 3 or 1 or 5 reads per base;

myobj = methRead(file.list,
                     sample.id = as.list(vec_names_selected),
                     assembly  = "hg38",
                     treatment = c(rep(1, vec_n_Group1), rep(0, vec_n_Group2)),
                     context   = "CpG",
                     mincov    = 3     # 1  or 5
    )

mincov means the least coverage reads in every sample is 3 , not in all samples simutaneously, is that right ? You mean if I use mincov = 1 here, it will include bases with very low quality/ not reliable methylation calls. So the bigger the mincov value, the better result. So you suggested to require at least a coverage of 3 reads per base.
By the way, if I use mincov = 5 here, or 3 as you said, then if it is not necessary for me to do filterByCoverage (myobj, lo.count = 5) then ?

Then, do the filterbyCoverage, if necessary, but this step must be put before tileMethyCounts() not after it. If I set mincov=5 at the first step, here filterByCoverage(lo.count = 5) has the same function, right ?


filtered_tiles = filterByCoverage(myobj, lo.count = 5)        ##  5 or bigger value

And then, do tileMethylCounts()

tiles = tileMethylCounts(filtered_tiles,
                               win.size = 500, 
                               step.size = 500, 
                               mc.cores = 30,
                               cov.bases = 5)

A small question: here I set win.size = 500, step.size = 500, is that ok? Because I want a 500bp window. But I don’t know the difference between win.size and step.size.

### In the end
meth_all_tiles = methylKit::unite(tiles, mc.core=30)

Finally, I understand the cov.bases(), it means the number of CpG sites in a specific window (tile region). So there must be more than 5 CpG sites but not how many reads each Coverage contains and not the summed Coverage in a tile region.

I would appreciate it if you give me a response for my questions above. Thanks in advance.

Best, yi

alexg9010 commented 1 year ago

Hi @DengEr-1993

mincov means the least coverage reads in every sample is 3 , not in all samples simutaneously, is that right ?

it is the minimum read coverage to be included in the methylKit objects. Any methylated base/region in the text files below the mincov value will be ignored. this holds true for every sample.

By the way, if I use mincov = 5 here, or 3 as you said, then if it is not necessary for me to do filterByCoverage (myobj, lo.count = 5) then ?

not necessarily. You might need to include this step when reading tabix files though, since they are not filtered but directly loaded.

If I set mincov=5 at the first step, here filterByCoverage(lo.count = 5) has the same function, right ?

yes.

A small question: here I set win.size = 500, step.size = 500, is that ok? Because I want a 500bp window. But I don’t know the difference between win.size and step.size.

yes, this should work. win.size sets the size of each window, step.size sets how far the next window will be shifted relative to previous one. if step.size is smaller than win.size you will get overlapping tiles .

Finally, I understand the cov.bases(), it means the number of CpG sites in a specific window (tile region). So there must be more than 5 CpG sites but not how many reads each Coverage contains and not the summed Coverage in a tile region.

correct.

DengEr-1993 commented 1 year ago

Hi @DengEr-1993

mincov means the least coverage reads in every sample is 3 , not in all samples simutaneously, is that right ?

it is the minimum read coverage to be included in the methylKit objects. Any methylated base/region in the text files below the mincov value will be ignored. this holds true for every sample.

By the way, if I use mincov = 5 here, or 3 as you said, then if it is not necessary for me to do filterByCoverage (myobj, lo.count = 5) then ?

not necessarily. You might need to include this step when reading tabix files though, since they are not filtered but directly loaded.

If I set mincov=5 at the first step, here filterByCoverage(lo.count = 5) has the same function, right ?

yes.

A small question: here I set win.size = 500, step.size = 500, is that ok? Because I want a 500bp window. But I don’t know the difference between win.size and step.size.

yes, this should work. win.size sets the size of each window, step.size sets how far the next window will be shifted relative to previous one. if step.size is smaller than win.size you will get overlapping tiles .

Finally, I understand the cov.bases(), it means the number of CpG sites in a specific window (tile region). So there must be more than 5 CpG sites but not how many reads each Coverage contains and not the summed Coverage in a tile region.

correct.

@alexg9010 , thank you very much. I have understood most key points.

You helped me a lot.

Thank you for you quick response.

Have a nice day.

Best, xiangyi