al2na / methylKit

R package for DNA methylation analysis
https://bioconductor.org/packages/release/bioc/html/methylKit.html
203 stars 96 forks source link

min.per.group in unite() --> NA values appear --> row counts = NULL #297

Closed mirurb closed 1 year ago

mirurb commented 1 year ago

Hello,

Sorry to bother again :). I have another question regarding NA values that appear after unite(). When setting min.per.group in unite(), many NA values appear. And when taking the row counts of this methylBase object (i.e. to see the number of unique CpG sites after destranding), the answer is NULL. Please see the screenshot below.

Is this because the NA values interfere with the row counts? Should these NAs be filtered out from the methylBase object? If yes, how would you do that? Finally, will having NAs affect further analysis, and therefore, be filtered out for all further analysis?

Here is my script:

library(methylKit)

file.list=list("UI-3063-Ml-LH-F14_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LH-F17_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LH-F18_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LH-F19_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LH-F21_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LH-M16_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LH-M17_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LH-M18_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LH-M21_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LH-M22_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LL-F14_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LL-F15_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LL-F16_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LL-F17_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LL-F21_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LL-M14_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LL-M15_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LL-M16_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LL-M17_bismark_CX_report_FINAL.txt", "UI-3063-Ml-LL-M21_bismark_CX_report_FINAL.txt", "UI-3063-Ml-SH-F10_bismark_CX_report_FINAL.txt", "UI-3063-Ml-SH-F17_bismark_CX_report_FINAL.txt", "UI-3063-Ml-SH-F18_bismark_CX_report_FINAL.txt", "UI-3063-Ml-SH-F20_bismark_CX_report_FINAL.txt", "UI-3063-Ml-SH-F23_bismark_CX_report_FINAL.txt", "UI-3063-Ml-SH-M11_bismark_CX_report_FINAL.txt", "UI-3063-Ml-SH-M15_bismark_CX_report_FINAL.txt", "UI-3063-Ml-SH-M17_bismark_CX_report_FINAL.txt", "UI-3063-Ml-SH-M21_bismark_CX_report_FINAL.txt", "UI-3063-Ml-SH-M23_bismark_CX_report_FINAL.txt")

myobj <- methRead(file.list, sample.id = list("LH_F14", "LH_F17", "LH_F18", "LH_F19", "LH_F21", "LH_M16", "LH_M17", "LH_M18", "LH_M21", "LH_M22", "LL_F14", "LL_F15", "LL_F16", "LL_F17", "LL_F21", "LL_M14", "LL_M15", "LL_M16", "LL_M17", "LL_M21", "SH_F10", "SH_F17", "SH_F18", "SH_F20", "SH_F23", "SH_M11", "SH_M15", "SH_M17", "SH_M21", "SH_M23"), assembly = "M_longipes_genome", dbtype = "tabix", pipeline = "bismarkCytosineReport", header = TRUE, sep = "\t", context = "CpG", resolution = "base", treatment = c(2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0), dbdir = "CpG_methylDB", mincov = 1) saveRDS(myobj,file="CpGobject.RDS") myobj <- readRDS("CpGobject.RDS")

meth=unite(myobj, destrand=TRUE, min.per.group = 1L) saveRDS(meth,file="meth_CpGobject.RDS") meth <- readRDS("meth_CpGobject.RDS") head(meth) nrow(meth)

Screenshot 2023-08-28 at 14 31 40

Thank you for your help!

Best wishes, Mirjam

alexg9010 commented 1 year ago

Hi @mirurb,

First related to your nrow issue: With our example data I can reproduce your error when running nrow on a methylBaseDB object (with Tabix file), but for methylBase objects (no Tabix file) this works.

Is this because the NA values interfere with the row counts?

The reason it works for the non-tabix objects is that they are saved in memory as data.frame, so nrow simply works, while the tabix-based object is not a data.frame and only loaded on demand. That might cause nrow to fail with NULL.

Concerning the NA's in your data, they occur because you used min.per.group=1L during uniting, which retains all regions covered by at least on sample per group, but any sample not covering a region will have NA values for this region.

Should these NAs be filtered out from the methylBase object? If yes, how would you do that?

Generally, I would suggest keeping only regions where all samples are covered and only playing with min.per.group if required. If you are using methylBase objects (no tabix file), you can simply use na.omit() on the object to filter NA's.
For a methylBaseDB object (with tabix file) to keep only regions where all samples are covered, you may run the initial unite command with min.per.group = NULL.

Finally, will having NAs affect further analysis, and therefore, be filtered out for all further analysis?

Most of our downstream functions deal with NA values by removing any row containing NA. This could potentially lead to downstream issues if your data contains a sample with NA in any row or if there are too few rows left after removing NA's.

Best Alex

mirurb commented 1 year ago

Hello @alexg9010, Thank you for the detailed explanation! Best, Mirjam