al2na / methylKit

R package for DNA methylation analysis
https://bioconductor.org/packages/release/bioc/html/methylKit.html
202 stars 96 forks source link

Issue with Unite function and large dataset #63

Open AJEinstein opened 7 years ago

AJEinstein commented 7 years ago

Hi Akalin,

I'm getting the following error when uniting my sample with the following code:

meth<-unite(object = familyObj, destrand=TRUE, by=.EACHI, save.db = F)

## Error in vecseq(f , len , if (allow.cartesian || notjoin || !anyDuplicated(f , : Join results in 216308346 rows; more than 60136491 = nrow(x)+nrow(i).
Check for duplicate key values in i each of which join to the same
group in x over and over again. If that’s ok, try by=.EACHI to run
j for each group to avoid the large allocation.  If you are sure you
wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please
search for this error message in the FAQ, Wiki, Stack Overflow and
datatable-help for advice.

I've checked to make sure that none of my samples have duplicate chr and coordinates using the following 1 liner $cut -f 2,3 ${dir}BSTag-Sample-${SampleName}-reprep_CpG.methylKit| sort | uniq -cd

I've also tried replicating the problem with only using 2-3 samples, but I received no error. I then tried to subset my 17 samples to just the first 1,000,000 lines; this received no error. I tried again when I used the first 5 million, and also received no error. I tried 10 million, and I got the same error (albeit less duplicates mentioned). I wanted to shrink the file sizes down so I subset the file to only rows>5,000,000 and less then 10,000,000; this returned the error:

## Error in vecseq(f , len , if (allow.cartesian || notjoin || !anyDuplicated(f , : Join results in 11620518 rows; more than 7736026 = nrow(x)+nrow(i).
Check for duplicate key values in i each of which join to the same
group in x over and over again. If that’s ok, try by=.EACHI to run
j for each group to avoid the large allocation.  If you are sure you
wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please...

I tried to subset the files even further, such as 5,000,000 < rows < 7,500,000 and 7,500,000 < rows < 10,000,000, but this did not replicate the error. So unfortunately, I'm making available some large files which replicate the error.

The following is the code that I used (using the most recent github install of methylKit):

suppressPackageStartupMessages({ library(methylKit)
})
library(RColorBrewer)
library(gplots)
library(knitr)

file.list=list("BS-TAG-Sample-77_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-78_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-79_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-80_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-81_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-82_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-83_CpG_percentFix_5000000.methylKit",
               "BSTag-Sample-84-reprep_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-85_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-86_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-87_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-88_CpG_percentFix_5000000.methylKit",
               "BSTag-Sample-89-reprep_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-90_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-91_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-92_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-93_CpG_percentFix_5000000.methylKit")

familyObj=methRead(file.list, sample.id=list("gm77", "gm78", "gm79", "gm80",
"gm81", "gm82", "gm83", "gm84", "gm85", "gm86", "gm87", "gm88",
"gm89", "gm90", "gm91","gm92", "gm93"), assembly="hg38_decoy_EBV",
treatment=rep(0,17), context="CpG", mincov = 3)

meth_debug<-unite(object = familyObj, destrand=TRUE, by=.EACHI, save.db = F)
# I've also tried
meth_debug<-unite(object = familyObj, destrand=TRUE)

I've placed the files in the following google drive (sorry about the size of the files): https://drive.google.com/drive/folders/0B1bAk0VjZmbVMURzOVJrWXVraUE?usp=sharing

Thank you for your help, Andrew

al2na commented 7 years ago

do you get the same error with destrand=TRUE?

On Wed, Mar 15, 2017 at 8:34 PM, Andrew Johnston notifications@github.com wrote:

Hi Akalin,

I'm getting the following error when uniting my sample with the following code:

meth<-unite(object = familyObj, destrand=TRUE, by=.EACHI, save.db = F)

Error in vecseq(f , len , if (allow.cartesian || notjoin || !anyDuplicated(f , : Join results in 216308346 rows; more than 60136491 = nrow(x)+nrow(i).

Check for duplicate key values in i each of which join to the same group in x over and over again. If that’s ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

I've checked to make sure that none of my samples have duplicate chr and coordinates using the following 1 liner $cut -f 2,3 ${dir}BSTag-Sample-${SampleName}-reprep_CpG.methylKit| sort | uniq -cd

I've also tried replicating the problem with only using 2-3 samples, but I received no error. I then tried to subset my 17 samples to just the first 1,000,000 lines; this received no error. I tried again when I used the first 5 million, and also received no error. I tried 10 million, and I got the same error (albeit less duplicates mentioned). I wanted to shrink the file sizes down so I subset the file to only rows>5,000,000 and less then 10,000,000; this returned the error:

Error in vecseq(f , len , if (allow.cartesian || notjoin || !anyDuplicated(f , : Join results in 11620518 rows; more than 7736026 = nrow(x)+nrow(i).

Check for duplicate key values in i each of which join to the same group in x over and over again. If that’s ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please...

I tried to subset the files even further, such as 5,000,000 < rows < 7,500,000 and 7,500,000 < rows < 10,000,000, but this did not replicate the error. So unfortunately, I'm making available some large files which replicate the error.

The following is the code that I used (using the most recent github install of methylKit):

suppressPackageStartupMessages({ library(methylKit) }) library(RColorBrewer) library(gplots) library(knitr)

file.list=list("BS-TAG-Sample-77_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-78_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-79_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-80_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-81_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-82_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-83_CpG_percentFix_5000000.methylKit", "BSTag-Sample-84-reprep_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-85_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-86_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-87_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-88_CpG_percentFix_5000000.methylKit", "BSTag-Sample-89-reprep_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-90_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-91_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-92_CpG_percentFix_5000000.methylKit", "BS-TAG-Sample-93_CpG_percentFix_5000000.methylKit")

familyObj=methRead(file.list, sample.id=list("gm77", "gm78", "gm79", "gm80", "gm81", "gm82", "gm83", "gm84", "gm85", "gm86", "gm87", "gm88", "gm89", "gm90", "gm91","gm92", "gm93"), assembly="hg38_decoy_EBV", treatment=rep(0,17), context="CpG", mincov = 3)

meth_debug<-unite(object = familyObj, destrand=TRUE, by=.EACHI, save.db = F)

I've also tried

meth_debug<-unite(object = familyObj, destrand=TRUE)

I've placed the files in the following google drive (sorry about the size of the files): https://drive.google.com/drive/folders/0B1bAk0VjZmbVMURzOVJrWXVraUE? usp=sharing

Thank you for your help, Andrew

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/63, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm9EbqAsKN-ctsib9cUPLCH9S_JdgHvks5rmD00gaJpZM4MebJy .

AJEinstein commented 7 years ago

I ran every unite command with destrand=TRUE. Would you like me to try destrand=FALSE? or am I missing something here?

al2na commented 7 years ago

yes, could you do that? Also make sure the files are tab separated, otherwise the text you are running for uniqueness will not work. If you don't get an error with destrand=FALSE, the duplication might be introduced internally and could be a bug. If you still get the error with destrand=FALSE, setting by=.EACHI might help but that has to be set by us in the function. you can't pass it as an argument probably

On Wed, Mar 15, 2017 at 11:29 PM, Andrew Johnston notifications@github.com wrote:

I ran every unite command with destrand=TRUE. Would you like me to try destrand=FALSE? or am I missing something here?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/63#issuecomment-286899951, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm9EST6iXqcZbMoUHv2hoRh10hIGv-Zks5rmGY_gaJpZM4MebJy .

AJEinstein commented 7 years ago

Sorry for the delay, HPC was shutdown for maintenance for a few days. The files are all tab separated. I am still getting the error when destrand=F. I haven't tried setting by=.EACHI, since you said that it will have to be set by you in the function. If you think it'll help, of course I'll try it with destrand=F. I tried with destrand = T, and it did not work.

al2na commented 7 years ago

trying with destrand=F will help us understand where the bug might be coming from, or if it is on our side or data.table. If you could try it and let me know that would be great

Best, Altuna

On Wed, Mar 22, 2017 at 3:17 PM, Andrew Johnston notifications@github.com wrote:

Sorry for the delay, HPC was shutdown for maintenance for a few days. The files are all tab separated. I am still getting the error when destrand=F. I haven't tried setting by=.EACHI, since you said that it will have to be set by you in the function. If you think it'll help, of course I'll try it with destrand=F. I tried with destrand = T, and it did not work.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/63#issuecomment-288412205, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm9EX0KO3XTm7QsM345gthUdbq0mGQ6ks5roS1dgaJpZM4MebJy .

AJEinstein commented 7 years ago

Hey Altuna,

Sorry for being lazy haha. I did run it as:

meth_debug<-unite(object = familyObj, destrand=FALSE, by=.EACHI) I get the same error unfortunately.

AJEinstein commented 7 years ago

Hey Altuna,

Any other suggestions for how I can help to debug?

Thanks, Andrew

al2na commented 7 years ago

If this is a data.table error and subsetting the data to smaller chunks works, you can try to use methylDB objects, these are tabix files essentially. during the unite function, the objects are processed chromosome by chromosome. If you get the error there, then there might be really a duplicate, but this is unlikely. Using methylDB objects might just work for you

AJEinstein commented 7 years ago

Thanks so much for the help. Of course it was something seemingly innocuous.

Turns out there were commented out lines that preceded each chromosome, removed them and it worked!

KristenKrolick commented 1 year ago

@AJEinstein , I am having the same problem! Can you please describe how you removed each of the commented out lines? Will the .bgz and .bgz.tbi files correspond correctly with the .txt file anymore if I remove these lines? And were you referring to these lines, that begin with a #?(I pasted my example below).

Date:2022-10-12 22:46:33

Class:methylRawDB

DT:tabix

SI:MRS03

AS:hg18

CT:CpG

RS:base

NR:9945587

DT:tabix

1 16244 16244 - 15 12 3 1 17378 17378 + 21 21 0 1 17406 17406 + 18 17 1 1 17452 17452 + 27 27 0 1 17453 17453 - 18 18 0 1 17478 17478 + 27 27 0 1 17479 17479 - 24 24 0 1 17483 17483 + 30 30 0 1 17484 17484 - 25 25 0 1 17492 17492 + 31 31 0 1 17493 17493 - 27 27 0

alexg9010 commented 1 year ago

Hi @krof7d,

the tabix header (your example) you are referring to was introduced in a later version of methylKit. Could you please provide more details and a reproducible example?

KristenKrolick commented 1 year ago

@alexg9010

Thank you, I see. Well I did try to delete the first 9 commented-out lines from every .txt file anyways, and I still received the same error message. The error message being:

"Reading file. Reading file. Reading file. uniting... Error in vecseq(f, len, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 19220976 rows; more than 18504185 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice. Calls: unite ... merge -> merge.data.table -> [ -> [.data.table -> vecseq Execution halted"

I have tried the unite function every which way (with min.per and without it, and with destrand and without it as well).

See below for all of my details:

I am using bam sorted and indexed files generated from bismark aligner. The data comes from whole-genome enzymatic rxn methylation sequencing. I have the 3 basic scripts I have performed on all 170 of my files, the 3rd script being the problem where I can not get the unite function to run.

  1. First I used processbismarkaln:

    !/bin/env Rscript

file.list <- list( "/data/methylseq_bam.sort.bai/OPD01-033_run499.sort.bam", "/data/methylseq_bam.sort.bai/OPD01-033.sort.bam", "/data/methylseq_bam.sort.bai/OPD01-037_run499.sort.bam", "/data/methylseq_bam.sort.bai/OPD01-037.sort.bam", # etc ... all files listed out here...

library(methylKit) objs = processBismarkAln( location = file.list, sample.id = list( "OPD01-033_run499", "OPD01-033", "OPD01-037_run499", "OPD01-037",

etc... all sample ids listed out here),

assembly="hg38", save.folder="/data/methylseq_bam.sort.bai/processbismarkaln", save.context="CpG", nolap=FALSE, mincov=10, minqual=20, treatment=c( 0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,1,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1))

  1. Next I read in these files in order to save them as flat file databases (and originally because I was doing about 50 files at a time as I was awaiting for bam files to be preprocessed):

!/bin/env Rscript

library(methylKit)

file.list <- list( "/data/methylseq_bam.sort.bai/methylRaw_objects/MRS03_CpG.txt", "/data/methylseq_bam.sort.bai/methylRaw_objects/MRS04_CpG.txt", "/data/methylseq_bam.sort.bai/methylRaw_objects/MS121_CpG.txt", "/data/methylseq_bam.sort.bai/methylRaw_objects/MS125_CpG.txt",

etc...)

myobjDB = methRead( file.list, sample.id = list( "MRS03", "MRS04", "MS121", "MS125", "MS129",

etc...

), assembly = "hg18", treatment = c( 1, 0, 1,

etc... ),

context = "CpG", dbtype = "tabix", dbdir = "methylDB" )

  1. So lastly, when I finally had the .txt, .txt.bgz, and .txt.bgz.tbi files for all of my samples needed, I read-them all into memory again, and tried uniting them: (w/ unite never working! [that is unless I test run it with 4 of my samples bc it works just fine then]).

!/bin/env Rscript

library(methylKit)

file.list <- list( "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS03.txt", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS04.txt", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS121.txt", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS125.txt", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS129.txt",

etc....

")

myobj = methRead( file.list, sample.id = list( "MRS03", "MRS04", "MS121", "MS125", "MS129", "MS130", "OPD01-033", "OPD01-033_run499", "OPD01-036", "OPD01-037",

etc...

), assembly = "hg18", treatment = c( 1, 0, 1, 0,

etc....),

context = "CpG")

length(myobj)

meth=unite(myobj, allow.cartesian=TRUE, destrand=FALSE)

myDiff=calculateDiffMeth(meth, mc.cores=2)

myDiff10p.hyper=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hyper")

myDiff10p.hypo=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hypo")

myDiff10p=getMethylDiff(myDiff,difference=10,qvalue=0.05)

----------------------------------------------------Again, here is the error message I get : "Reading file. Reading file. Reading file. uniting... Error in vecseq(f, len, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 19220976 rows; more than 18504185 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice. Calls: unite ... merge -> merge.data.table -> [ -> [.data.table -> vecseq Execution halted"

alexg9010 commented 1 year ago

Hi @krof7d,

In your third script where you want to unite the samples, you are supposed to read in the files with the .txt.bgz extension as these are the bgizp compressed tabix files. The commented-out lines are the tabix header where we capture object details.

I quickly checked with some samples and unfortunately, we do not break on loading the uncompressed tabix files. Even uniting the samples worked in my case. However, when printing the objects I saw that the content did not make any sense compared to the contents of the tabix file.

file.list <-
    list.files( [..],
        pattern = ".txt$"
        full.names = TRUE
    )
sample.list <- gsub(".deduped.*", "", basename(file.list))

methlist <- methRead(
    location = as.list(file.list),
    sample.id = as.list(sample.list),
    assembly = "hg38",
    context = "CpG",
    mincov = 5,
    treatment = rep(0, length(file.list))
)

> methlist[[2]]  # reading txt files -- > wrong
methylRaw object with 5993 rows
--------------
  chr start end strand coverage numCs numTs
1  33    33  33      +      370    22  1347
2  34    34  34      +      270    73   656
3  61    61  61      +      457    14  2075
4  62    62  62      +      369    77  1284
5  78    78  78      +      493    25  2406
6  79    79  79      +      420    80  1684
--------------
sample.id: FDLM220140374-1a_HJ7TTDSX3_L4 
assembly: hg38
context: CpG 
resolution: base 

methSingle <- methRead(
    location = "[...]/FDLM220140374-1a_HJ7TTDSX3_L4.deduped_CpG.txt.bgz",
    sample.id = "samp1",
    assembly = "AnoSag2.1",
    context = "CpG",
    mincov = 5,
    dbtype = "tabix",     # this is important! 
    treatment = rep(0, length(file.list))
)

> methSingle # reading txt.bgz files -- > right
methylRawDB object with 5994 rows
--------------
   chr  start    end strand coverage numCs numTs
1 chr1 629545 629545      +       10     1     9
2 chr1 629596 629596      +       24     4    20
3 chr1 629609 629609      +       30     2    28
4 chr1 629633 629633      +       35     3    32
5 chr1 629660 629660      +       52     2    50
6 chr1 629701 629701      +       83     4    79
--------------
sample.id: FDLM220140374-1a_HJ7TTDSX3_L4.deduped_CpG 
assembly: hg38 
context: CpG 
resolution: base 
dbtype: tabix 

Thus to fix your unite script, please use the files with the ".txt.bgz" extension.

KristenKrolick commented 1 year ago

@alexg9010 Okay, I tried using the .txt.bgz files and this error message was returned:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘unite’ for signature ‘"NULL"’ Calls: unite -> Execution halted

Plus, when I just look at the length of my object, it says zero:

Received list of locations.

length(myobj) [1] 0

So there is definitely something wrong with the .txt.bgz files? Should I just try rerunning everything from the beginning? But processbismarkaln takes a really long time, so I do not have the time to rerun that one from the beginning again.

alexg9010 commented 1 year ago

Hi @krof7d,

This is weird, could you please show the code that you used? Also, could you please check that the tabix files actually still exist?

Best, Alex

KristenKrolick commented 1 year ago

@alexg9010

Okay!

So when I go to unzip, for example, my MRS03.txt.bgz with 7-zip and view it next to my MRS03.txt in notepad, they both look the exact same to me: MRS03

And how should I check if my tabix still exist, please?

And here is the code I used:

!/bin/env Rscript

setwd("/data/methylseq_bam.sort.bai/processbismarkaln/methylDB")

library(methylKit)

file.list <- list( "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS03.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS04.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS121.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS125.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS129.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS130.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/OPD01-033.txt.bgz", etc....)

myobj = methRead( file.list, sample.id = list( "MRS03", "MRS04", "MS121", "MS125", etc.... ), assembly = "hg18", treatment = c( 1, 0, 1, 1, 1, etc....), context = "CpG")

length(myobj)

meth=unite(myobj, allow.cartesian=TRUE, destrand=FALSE)

myDiff=calculateDiffMeth(meth, mc.cores=2)

myDiff10p.hyper=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hyper")

myDiff10p.hypo=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hypo")

myDiff10p=getMethylDiff(myDiff,difference=10,qvalue=0.05)

alexg9010 commented 1 year ago

In your call to methRead, did you include the argument dbtype = "tabix" ?

KristenKrolick commented 1 year ago

@alexg9010

Okay, sorry about that, so I have now included dbtype = "tabix" into the same exact code from above, and this was my error message: Received list of locations. uniting...

checking wether tabix file already exists: /data/methylseq_bam.sort.bai/processbismarkaln/methylDB/./methylBase_7b627cade745.txt.bgz tabix file is new. continuing now ...

merging tabix files ... compressing the file with bgzip... making tabix index... Error in value[3L] : no base were united. try adjusting 'min.per.group'. Calls: unite ... tryCatch -> tryCatchList -> tryCatchOne -> Execution halted

So then I tried the same exact code from above (w/ the min.per group this time, plus I also added destrand=TRUE because I did not realize I still had it set to false). So here is what the code looked like:

!/bin/env Rscript

setwd("/data/methylseq_bam.sort.bai/processbismarkaln/methylDB")

library(methylKit)

file.list <- list( "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS03.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS04.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS121.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS125.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS129.txt.bgz", etc....)

myobj = methRead( file.list, sample.id = list( "MRS03", "MRS04", "MS121", "MS125", "MS129", etc... ), assembly = "hg18", treatment = c( 1, 0, 1, 1, 1, 0, 0, etc....), context = "CpG", dbtype = "tabix")

length(myobj)

meth=unite(myobj, allow.cartesian=TRUE, destrand=FALSE)

meth.min=unite(myobj,min.per.group=1L, destrand=TRUE)

myDiff=calculateDiffMeth(meth, mc.cores=2)

myDiff10p.hyper=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hyper")

myDiff10p.hypo=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hypo")

myDiff10p=getMethylDiff(myDiff,difference=10,qvalue=0.05)

And here was the error message I received: Received list of locations. destranding... uniting...

checking wether tabix file already exists: /data/VidyaCLab/methylseq_bam.sort.bai/processbismarkaln/methylDB/./methylBase_38441b977501.txt.bgz tabix file is new. continuing now ...

merging tabix files ... compressing the file with bgzip... making tabix index... filter mincov per group ... Error in getTabixByChunk(tbxFile, chunk.size = NULL, return.type = "data.frame") : the tabix file seems to be empty. stopping here. Calls: unite ... applyTbxByChunk -> lapply -> FUN -> getTabixByChunk Execution halted

Here was the output of length(myobj) [1] 165

(so the myobj seemed to be successful).

And it also looked like a methylbase file was starting to be made-- see pic pasted below. methylbase

Here is what the methylbase file looked like in notepad after I unzipped it: methylbaseopened

I also tried running the same exact code from above in this same comment with destrand = false. That job quickly did not work, and here was my error message: Received list of locations. uniting...

checking wether tabix file already exists: /data/VidyaCLab/methylseq_bam.sort.bai/processbismarkaln/methylDB/./methylBase_3da4463c424c.txt.bgz tabix file is new. continuing now ...

merging tabix files ... compressing the file with bgzip... making tabix index... filter mincov per group ... Error in getTabixByChunk(tbxFile, chunk.size = NULL, return.type = "data.frame") : the tabix file seems to be empty. stopping here. Calls: unite ... applyTbxByChunk -> lapply -> FUN -> getTabixByChunk Execution halted

alexg9010 commented 1 year ago

The error messages seem to indicate that there are no CpGs overlapping between any two samples. But this is quite unexpected with this large number of samples.

In general, this could be due low coverage, because in the methRead function the default minimum coverage ( min.cov ) is 10 reads per base, thus all bases with lower coverage are omitted. Setting this to a lower value should increase the number of bases.

How many CpGs are roughly covered per Sample and how does the coverage distribution of your samples look like using the function getCoveragestats ?

KristenKrolick commented 1 year ago

@alexg9010 Sorry for the short delay. I went to read in my .txt.bgz files just as we had it working before (I'll paste the code below) so that I could next use the getstats function as you suggested, but it won't read in the files saying some sort of error that they are not bgzipped now? Did something happen with the methylKit code?

Here was the error message that was returned: Received list of locations. Error in value[3L] : file does not appear to be bgzip'd file: /data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS03.txt.bgz

Once I get this part working again. I will definitely check on the methylation stats and get back to you. Thank you so much for your help.

And here was my code (same as we had finally working from last comments)

Here was the code I used: library(methylKit)

file.list <- list( "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS03.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS04.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS121.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS125.txt.bgz", "/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MS129.txt.bgz", etc...)

myobj = methRead( file.list, sample.id = list( "MRS03", "MRS04", "MS121", "MS125", "MS129", "MS130", etc... ), assembly = "hg18", treatment = c( 1, 0, 1, 1, 1, etc...), context = "CpG", dbtype = "tabix", mincov = 10)

alexg9010 commented 1 year ago

Hi @krof7d,

Sorry for the belated reply. The error occurs when trying to index the tabix file and indicates that the corresponding file is not compressed using bgzip. I quickly checked and you should be able to fix the problem by compressing the file again using Rsamtools::bgzip(). You may want to uncompress the file before and then run above command on the uncompressed file to maintain the correct extensions, since the function will add ".bgz" as extension.

KristenKrolick commented 1 year ago

@alexg9010

My files are bgzipped. I don't get if you were saying 1) you knew my files were bgzipped already, and that I had to uncompress then compress them again anyways to get it to work OR 2) you thought my files were not bgzipped but instead compressed a different way??

Remember, the same exact methread command/code that I had run before and had worked on the same exact .txt.bgz files, is not working anymore. Nothing changed with my .txt.bgz files. Are you sure there is not something wrong with the methylkit code now? For example, I noticed that you raised a ticket: https://github.com/al2na/methylKit/issues/272 to verify that the file is tabix-- are you sure something was not messed up with the methylkit code in doing so?

I did perform re-bgzipping from the .txt files anyways (if this is what you meant-- I couldn't find an uncompress option using bgzip). Here was how I did it:

_library(Rsamtools) all.files <- list.files("/data/methylseqbam.sort.bai/processbismarkaln/methylDB", pattern = "*.txt$", full.names = TRUE) str(all.files) for (i in all.files) {bgzip(i, overwrite=TRUE)}

Then, when I use the newly bgzipped files, I am still getting the same exact error/having the same exact problem as my last comment. (the code would be the exact same as my last comment).

So therefore, I cannot even read in files anymore to get a myobj to be checking the coverage stats for figuring out why the unite function doesn't work for me (Re: your comment from 7 days ago).

Thanks, Kristen

alexg9010 commented 1 year ago

Hi @krof7d,

The error message you posted before indicates something wrong with the compression of the file/data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS03.txt.bgz.

Here was the error message that was returned: Received list of locations. Error in value[[3L]]: file does not appear to be bgzip'd file: /data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS03.txt.bgz

My first assumption was that it got only gzipped or not compressed at all, as a side effect of the unzipping and checking with notepad. That is why I asked you to re-compress, however it might be a different issue.

Okay, Could you please paste here the output of the following command:

sapply(list.files("/data/methylseq_bam.sort.bai/",pattern = "MRS03", recursive = TRUE), function(x) print(file(x)))
KristenKrolick commented 1 year ago

@alexg9010 Nevermind, the methread command is working again for me! I think it was due to me switching between R studio on my laptop and my work's cluster network when I need to actually submit the job but then not changing the file location correctly in my file.list command:

I will continue to work on checking the methylation stats now, and let you know what they are so we can figure out why the unite function is not working for me.

I'll leave the comment I had previously, before I realized what was wrong, under the dotted line below, if for some reason you still care to read output of sapply:


  1. Here is the returned output using: sapply(list.files("/data/methylseq_bam.sort.bai/",pattern = "MRS03", recursive = TRUE), function(x) print(file(x))) Routput3 Routput4 Routput5

  2. I am not sure you wanted all of the returned output from other folders I had containing the same sample names, so here is me doing it specifying all the way to the specific methylDB folder containing the specific files I would use in the methread command: Here is the returned output using: sapply(list.files("/data/methylseq_bam.sort.bai/processbismarkaln/methylDB",pattern = "MRS03", recursive = TRUE), function(x) print(file(x))) Routput2

Not sure why the 'class' column changes from "file" to "gzfile" for the same file, /data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS03.txt.bgz, between 1 & 2.

alexg9010 commented 1 year ago

Thanks for providing this update!

KristenKrolick commented 1 year ago

@alexg9010 Alex, An update on this: I finally got the unite function to work! It seems half of the bam files I was given were misaligned this whole time. It is weird, because the coverage seemed to have nothing to do with it (like they could have said 80% coverage, but when I actually viewed the bam file with samtools it was aligned non-directionally when it should have been aligned directionally). So, I re-aligned ~80 of my bam files so far, and the ones aligned directionally all worked in methylkit unite, while the ones which were aligned non-directionally did not unite in methylkit. Just something to watch out for, I am guessing that even though they said 80% coverage, there must have been a lot of false alignment. Thank you for your help! Kristen

alexg9010 commented 1 year ago

Hi @KristenKrolick, Great that it works now. Thanks for bringing up the point with the directionality, I guess this is something we should follow up on in the future, given that the default 'directional' assumption may be altered with newer sequencing protocols (e.g. pbat). Until this is resolved, the best option is possibly to use alternative methylation callers (e.g. bismark_methylation_extractor or methylDackel ) and operate on the resulting output files. Best, Alex