al2na / methylKit

R package for DNA methylation analysis
https://bioconductor.org/packages/release/bioc/html/methylKit.html
203 stars 96 forks source link

Data.frame error during differential methylation with covariate. #304

Closed desmodus1984 closed 6 months ago

desmodus1984 commented 11 months ago

Hi,

I am analyzing 44 samples that have been filtered and normalized, and the unite file was saved in RDS format in R. I have been previously capable of running the differential methylation after saving and reading the RDS file in R, but now, I have added covariates as a cvs file, I got the message that the analysis is running:

covariates <- read.csv("ELE-State.csv")
covariates$State<- as.character(covariates$State)

I got the message that the analysis is running: WCov.ELE-State <- calculateDiffMeth(meth_State, covariates = covariates,overdispersion = "MN", test ="Chisq", adjust="BH", mc.cores= 4) two groups detected: will calculate methylation difference as the difference of treatment (group: 1) - control (group: 0)

But then it crashes: "Error in data.frame(subst[, 1:4], tmp$p.value, p.adjusted(tmp$q.value, : arguments imply differing number of rows: 26052521, 0, 1 Calls: calculateDiffMeth ... calculateDiffMeth -> .calculateDiffMeth -> data.frame In addition: Warning message: In mclapply(cntlist, logReg, vars, treatment = treatment, overdispersion = overdispersion, : all scheduled cores encountered errors in user code Execution halted"

I set the analysis with 4 cores, I am the only one using the node, and while checking only 2 cores are used. It works fine for a couple of hours and then it crashes. I was able to do the analysis with no covariates, but then adding them it failed. Any reason for this weird behavior?

I am coping the code which previously worked well.

Thanks;

alexg9010 commented 11 months ago

HI @desmodus1984,

It seems there is an error during the parallel execution of the logReg() function since the returned dataframe seems to be empty. tmp should contain the test results per region, but seems to be empty:

"Error in data.frame(subst[, 1:4], tmp$p.value, p.adjusted(tmp$q.value, : arguments imply differing number of rows: 26052521, 0, 1

https://github.com/al2na/methylKit/blob/388770da40bde7d563224cc5d2259c2a7a6a2e08/R/diffMeth.R#L647-L660

If there is a problem during modelling, I suggest running calculateDiffMeth with only one core, to see what is causing the error.

Best, Alex

desmodus1984 commented 10 months ago

Hi,

I ran it again using DiffMeth with 2 cores and I got different error messages:

will calculate methylation difference as the difference of treatment (group: 1) - control (group: 0) Error in data.frame(subst[, 1:4], tmp$p.value, p.adjusted(tmp$q.value, : arguments imply differing number of rows: 26052521, 0, 1 Calls: calculateDiffMeth ... calculateDiffMeth -> .calculateDiffMeth -> data.frame In addition: Warning message: In mclapply(cntlist, logReg, vars, treatment = treatment, overdispersion = overdispersion, : all scheduled cores encountered errors in user code Execution halted

Error in data.frame(subst[, 1:4], tmp$p.value, p.adjusted(tmp$q.value, : arguments imply differing number of rows: 26052521, 0, 1 Calls: calculateDiffMeth ... calculateDiffMeth -> .calculateDiffMeth -> data.frame In addition: Warning message: In mclapply(cntlist, logReg, vars, treatment = treatment, overdispersion = overdispersion, : all scheduled cores encountered errors in user code Execution halted

Error in if (any(y < 0 | y > 1)) stop("y values must be 0 <= y <= 1") : missing value where TRUE/FALSE needed Calls: calculateDiffMeth ... mclapply -> lapply -> FUN -> glm.fit -> eval -> eval Execution halted

I am using Reorganize to subset samples from a unite object previously saved as rds object, and then setting it properly and then the DiffMeth doesn't work.

Juan Pablo Aguilar Cabezas

Ecology and Evolutionary Biology Ph.D. Candidate

Department of Biological Sciences

Ohio University, Athens OH


From: Alexander Blume @.> Sent: Wednesday, October 18, 2023 2:16 AM To: al2na/methylKit @.> Cc: Aguilar Cabezas, Juan Pablo @.>; Mention @.> Subject: [External] Re: [al2na/methylKit] Data.frame error during differential methylation with covariate. (Issue #304)

Use caution with links and attachments.

HI @desmodus1984https://github.com/desmodus1984,

It seems there is an error during the parallel execution of the logReg() function since the returned dataframe seems to be empty. tmp should contain the test results per region, but seems to be empty:

"Error in data.frame(subst[, 1:4], tmp$p.value, p.adjusted(tmp$q.value, : arguments imply differing number of rows: 26052521, 0, 1

https://github.com/al2na/methylKit/blob/388770da40bde7d563224cc5d2259c2a7a6a2e08/R/diffMeth.R#L647-L660

If there is a problem during modelling, I suggest running calculateDiffMeth with only one core, to see what is causing the error.

Best, Alex

— Reply to this email directly, view it on GitHubhttps://github.com/al2na/methylKit/issues/304#issuecomment-1767816118, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJWD2VPT7PBBE4R5MC7G6V3X7563JAVCNFSM6AAAAAA6C3FGYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRXHAYTMMJRHA. You are receiving this because you were mentioned.Message ID: @.***>

alexg9010 commented 10 months ago

HI @desmodus1984,

Error in if (any(y < 0 | y > 1)) stop("y values must be 0 <= y <= 1") : missing value where TRUE/FALSE needed

This error has already been reported before (see https://github.com/al2na/methylKit/issues/194). The problem seems to occur because of missing values (NA) in the data.

For now, the proposed solutions would be to remove rows containing missing values in your reorganized object or switching to tiling window based analysis. You could also start fresh and decrease the minimum coverage threshold for methRead to 5 or 3.

If you are able to narrow down your dataset to a small subset which reproduces the error, I could also have a look myself and try to debug this further.

Best, Alex

desmodus1984 commented 10 months ago

Hi,

Thanks for the email. How can I remove the NAs from the data matrix/frame?

Which coverage do you recommend? I was strict when building the matrix and I was using a coverage threshold of 10.

I can save the subset data as RDS file so that you can try it and test it yourself.

Thanks;

Juan Pablo Aguilar Cabezas


From: Alexander Blume @.> Sent: Monday, October 23, 2023 9:05 AM To: al2na/methylKit @.> Cc: Aguilar Cabezas, Juan Pablo @.>; Mention @.> Subject: [External] Re: [al2na/methylKit] Data.frame error during differential methylation with covariate. (Issue #304)

Use caution with links and attachments.

HI @desmodus1984https://github.com/desmodus1984,

Error in if (any(y < 0 | y > 1)) stop("y values must be 0 <= y <= 1") : missing value where TRUE/FALSE needed

This error has already been reported before (see #194https://github.com/al2na/methylKit/issues/194). The problem seems to occur because of missing values (NA) in the data.

For now, the proposed solutions would be to remove rows containing missing values in your reorganized object or switching to tiling window based analysishttps://bioconductor.org/packages/devel/bioc/vignettes/methylKit/inst/doc/methylKit.html#35_Tiling_windows_analysis. You could also start fresh and decrease the minimum coverage threshold for methRead to 5 or 3.

If you are able to narrow down your dataset to a small subset which reproduces the error, I could also have a look myself and try to debug this further.

Best, Alex

— Reply to this email directly, view it on GitHubhttps://github.com/al2na/methylKit/issues/304#issuecomment-1775283268, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJWD2VNQGRPSDQ47YMMHOG3YAZ2TLAVCNFSM6AAAAAA6C3FGYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZVGI4DGMRWHA. You are receiving this because you were mentioned.Message ID: @.***>

alexg9010 commented 10 months ago

HI @desmodus1984,

How can I remove the NAs from the data matrix/frame?

If you are using in-memory objects, you should be able to run na.omit() on that object, which should remove all rows with NA. You can check the number of NA containing rows using YOUR_OBJECT |> apply(anyNA,MARGIN = 1) |> table().

Which coverage do you recommend? I was strict when building the matrix and I was using a coverage threshold of 10.

In general, I would not recommend going below coverage of 10. Only when you have very sparse data (RRBS, cfDNA), you could lower the coverage threshold and combine this with the regional analysis.

Best, Alex

desmodus1984 commented 10 months ago

Hi,

I have WG-ez-meth-seq, and theoretically my samples have high coverage. For methread I used cov=10, and during percMethylation I get plenty of sample/sites with NAs, which makes me wonder whether using this matrix will work for downstream analysis. I am checking the code of a friend, and he used regular unite, and 4 cores, and ended up with several M sites. I processed the samples similarly, but used unitewith min.per.group=3L. Awkwardly, when I ran my analysis with methread with mincov=10, followed by filterByCoverage with lo.count=10, lo.perc=NULL, hi.count=NULL, and hi.perc=99.9, and normalizeCoveragebased on median, I end up with just 292148, instead of 25924803.

Is there any way to track how many sites are being lost during the filtering? My friend used hi.perc=99.9, which is more strict, right?

I trimmed the reads with TrimGalore as suggested by Felix Krueger https://felixkrueger.github.io/Bismark/bismark/library_types/#em-seq-neb only trimming 5 bp based on mBias plots, then mapped with BWA-meth using standard parameters, prepped the bam file with samtools, and got the methylation metrics with Methyldackel. Ex. MethylDackel extract -@ 8 --OT 0,0,0,0 --OB 0,0,0,0 --methylKit ../Bvos.fasta V00809.mrkdup.bam -o V00809_methyldackel

Did I do something wrong? Did I miss any parameters to retain methylated sites? The only thing I see I did was trimming the reads at the 5' and 3' ends; I don't know why with "regular" unite, I lose so many sites. I must say that with normalization the depth (coverage) reduces significantly. cov-change I am interpreting things well?

Thanks;

alexg9010 commented 10 months ago

Hi @desmodus1984 ,

I have WG-ez-meth-seq, and theoretically my samples have high coverage.

Is the experiment whole genome or targeted? If it is whole genome you would expect overall quite even coverage, targeted should be rather sparse but with higher coverage.

For methread I used cov=10, and during percMethylation I get plenty of sample/sites with NAs, which makes me wonder whether using this matrix will work for downstream analysis.

For differential methylation testing downstream of unite, NA rows will be removed, thus it should just work. The percMethylation matrix itself is not used downstream, so if you are doing some custom analysis/plots with it you will have to deal with NA's accordingly.

I am checking the code of a friend, and he used regular unite, and 4 cores, and ended up with several M sites. I processed the samples similarly, but used unite with min.per.group=3L.

Is there a specific reason you are changing the min.per.group? I would generally recommend using the default /"regular" setting. Are you comparing different subgroups? If yes, you may rather use the reorganize function to prepare the appropriate subset and proceed with that for every comparison.

Awkwardly, when I ran my analysis with methread with mincov=10, followed by filterByCoverage with lo.count=10, lo.perc=NULL, hi.count=NULL, and hi.perc=99.9, and normalizeCoverage based on median, I end up with just 292148, instead of 25924803.

The filtering by coverage is expected to remove spurious sites suffering from PCR duplication. Also, especially when using local alignment with bwa-meth it is expected that many reads map to repeat regions in the genome ( see https://web.archive.org/web/20230531171233/https://sequencing.qcfail.com/articles/soft-clipping-of-reads-may-add-potentially-unwanted-alignments-to-repetitive-regions/).

Is there any way to track how many sites are being lost during the filtering?

Just comparing the number of sites per object should do it, right? Filtering via lo.count=10 should not change anything, since you already set mincov=10, thus the effect will mostly be due to hi.perc=99.9.

My friend used hi.perc=99.9, which is more strict, right?

This will remove the regions having higher coverage than the 99.9th percentile of the whole distribution. This is rather conservative and according to your plots worked very well to remove outlier sites with unreasonably high coverage.

Did I do something wrong? Did I miss any parameters to retain methylated sites?

The processing steps look good and your data seems fine to me. The filtering removed spurious sites and left you with a reasonable coverage distribution. The final goal should be to select the most reliable sites for differential analysis and not forcefully retain as many sites as possible for testing.

Best Alex