angieyen / ChromDiff

ChromDiff program as described in Yen and Kellis, Nature Communications 2015.
http://compbio.mit.edu/ChromDiff
GNU General Public License v3.0
11 stars 2 forks source link

get_residuals error: Total counts of E018 does not agree with previous counts #2

Closed dyndna closed 8 years ago

dyndna commented 8 years ago

Hi Angie,

I am trying to run chromdiff and follow your steps using roadmap data. When I run command, ./notes_v2.sh, it outputs error with regards to get_residual function. I believe I've all software requirements as you specified

[1] "Reading data file for E024..."
[1] "Loading rdata file for E020..."
[1] "Reading data file for E020..."
[1] "Loading rdata file for E019..."
[1] "Reading data file for E019..."
[1] "Loading rdata file for E018..."
Error in get_residuals(model, metric, property, no_covariate_correction) : 
  Total counts of E018 does not agree with previous counts
Calls: comp.groups -> get_residuals
Execution halted

Thanks

angieyen commented 8 years ago

Hi,

Can you check that the counts files were generated correctly? If you run the command wc -l perc/core_gencode_v10/numsonly/*_counts.txt from the home directory, you can check that all the files have the same number of lines. (In my folder, they are all 299,025 lines long.) If you run the command head -n 1 perc/core_gencode_v10/numsonly/*_counts.txt you can see the first line of each file, which should be the number "11,322". Specifically, if the pre-processing was done correctly, these files should all be identical. In other words, running diff -q perc/core_gencode_v10/numsonly/E024_counts.txt perc/core_gencode_v10/numsonly/E018_counts.txt should give you no output. If it instead prints "Files perc/core_gencode_v10/numsonly/E024_counts.txt and perc/core_gencode_v10/numsonly/E018_counts.txt differ" something went wrong at an earlier pre-processing step. (Perhaps the script was killed before it finished?)

Thanks, Angela

dyndna commented 8 years ago

Hi Angela,

Thanks and I checked using wc -l, head -n1. Both commands output same as what you wrote but diff command says files are different. This is true for E018 and remaining of 127 coreMarks files.

Here is debug info for notes_v2.sh Does a command looks ok?

+ Rscript test_args_v2.R sex Female Male wilcox fdr core_gencode_v10 data/final_celltype_metadata.txt data/gencode_genes_full.txt data/cov.mat.txt data/map_vars_covariates.txt data/57epigenomes.RPKM.pc data/core_annotation.txt gencode_v10
Read 19935 items
[1] "Getting residuals..."
Read 299025 items
[1] "Loading celltype data from each file..."
[1] "Loading rdata file for E017..."
[1] "Reading data file for E017..."
[1] "Loading rdata file for E002..."
[1] "Reading data file for E002..."
[1] "Loading rdata file for E008..."
[1] "Reading data file for E008..."
[1] "Loading rdata file for E001..."
[1] "Reading data file for E001..."
[1] "Loading rdata file for E015..."
[1] "Reading data file for E015..."
[1] "Loading rdata file for E014..."
[1] "Reading data file for E014..."
[1] "Loading rdata file for E016..."
[1] "Reading data file for E016..."
[1] "Loading rdata file for E003..."
[1] "Reading data file for E003..."
[1] "Loading rdata file for E024..."
[1] "Reading data file for E024..."
[1] "Loading rdata file for E020..."
[1] "Reading data file for E020..."
[1] "Loading rdata file for E019..."
[1] "Reading data file for E019..."
[1] "Loading rdata file for E018..."
Error in get_residuals(model, metric, property, no_covariate_correction) : 
  Total counts of E018 does not agree with previous counts
Calls: comp.groups -> get_residuals
Execution halted
angieyen commented 8 years ago

It seems unusual that it is loading the E018 rdata file when it gives an error, even though for the other samples, it loads the data (unsuccessfully) and then reads the data from the original data file.

Perhaps you could delete the corresponding rdata files to start from a "clean" slate? Perhaps an rdata file got saved from an earlier (problematic) run of the code, but is being loaded in and causing problems now...

rm -r rdata/core_gencodev10/perc/*

Additionally, is it only the E018 file differs from the rest? If so, it seems like some error must have happened while processing that data in Step 2: ## Step 2: Calculate feature values ./calculate_all_raw_features.sh $metadatafile $statecalls_label $generegions_label $states_info

Does everything look okay with the E018 data you downloaded? ls -l statecalls/core/E018_15_coreMarks_mnemonics.bed.gz

Did you get any errors while running the earlier steps? If E018 is the only problem, perhaps it would be worth it to delete those files and re-run Step 2. ./calculate_raw_features.sh E018 statecalls/core/E018_15_coreMarks_mnemonics.bed.gz core gencode_v10 data/core_annotation.txt

Unfortunately, this is not an error I got, and I can't seem to be able to reproduce it with the data, so it seems like it was likely either a problem with the data that was downloaded or a problem with an earlier processing step getting killed...

angieyen commented 8 years ago

If you want to compare what the E018 files should look like, you can see them here: https://www.dropbox.com/sh/2tgp6h3gljqsje5/AADajREk2sOxCJSdkWRlo2bwa?dl=0 They should match the files in perc/core_gencodev10 and perc/core_gencodev10/numsonly

dyndna commented 8 years ago

Hi Angela,

Thanks for taking time to debug this. Looks like I am unable to run example data despite following your replies line by line. I will start from scratch again in debug mode to see if I can run chormdiff.

angieyen commented 8 years ago

Hi Samir,  A good way to test would be to first try with a couple of the samples rather than all 127. My instinct is this would work, since at least 12 of the samples seem to be fine, before E008 is giving an error. This would just require updating the cell type metadata file for a subset of the samples and then running as before. If you want to try that, I can help you set it up.Did you compare your E008 files to the ones I posted?  Best,Angela

On Sun, Apr 17, 2016 at 12:59 PM -0700, "Samir B. Amin" notifications@github.com wrote:

Hi Angela,

Thanks for taking time to debug this. Looks like I am unable to run example data despite following your replies line by line. I will start from scratch again in debug mode to see if I can run chormdiff.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

dyndna commented 8 years ago

Hi Angela,

Looks like something went wrong with partial run and when I rerun from scratch, it worked unti very end and returned this error. Looks like error with GSEA function.

Generating clustered heatmap...
[1] "Loading plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/sig_maj_plot_dend_domstate.rdata ..."
null device 
          1 
[1] "Making Dominant chromatin state plot..."
[1] "Writing to plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/sigfeats_feats_clust_vert_ints_domstate.rdata"
[1] "Making Gene expression differences plot..."
[1] "Loading plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/sig_maj_plot_dend_domstate.rdata ..."
There were 50 or more warnings (use warnings() to see the first 50)
There were 50 or more warnings (use warnings() to see the first 50)
There were 50 or more warnings (use warnings() to see the first 50)
There were 50 or more warnings (use warnings() to see the first 50)
Error in gzfile(file, "wb") : cannot open the connection
Calls: get_msigdata -> save -> gzfile
In addition: Warning message:
In gzfile(file, "wb") :
  cannot open compressed file '~/myscratch/cdiff/ChromDiff/msigdb/rdata/c2_c5.symbols.rdata', probable reason 'No such file or directory'
Execution halted

Also, I see those warnings and will debug R function for more. See if you can spot missing file error.

Thanks Samir

dyndna commented 8 years ago

Here is some debug on get_msigdata R function. Looks like there is missing msigdata object.

debug: matched.genes = convert(gencode.genes, idtype)
debug: final.genes = remove.extra(matched.genes)
debug: print("Getting enrichments...")
[1] "Getting enrichments..."
debug: gene.universe = get_gene.universe(idtype)
debug: results = get_enrichments(msigdata, final.genes, gene.universe)
Error in apply(msigdata, 1, function(msigdbgroup) { :
  object 'msigdata' not found
Calls: get_msigdata -> get_enrichments -> apply
debug: if (file.exists(rdatafile)) {
    load(rdatafile)
} else {
    msigfile = paste0(HOMEDIR, "msigdb/genesets/", msigdbset,
        ".all.v4.0.", idtype, ".gmt")
    sprintf("msigfile is at %s, rdatafile is at %s", msigfile,
        rdatafile)
    numcols = count.fields(msigfile)
    msigdata = read.table(msigfile, fill = TRUE, col.names = 1:max(numcols))
    save(msigdata, file = rdatafile)
}
debug: msigfile = paste0(HOMEDIR, "msigdb/genesets/", msigdbset, ".all.v4.0.",
    idtype, ".gmt")
debug: sprintf("msigfile is at %s, rdatafile is at %s", msigfile, rdatafile)
debug: numcols = count.fields(msigfile)
debug: msigdata = read.table(msigfile, fill = TRUE, col.names = 1:max(numcols))
debug: save(msigdata, file = rdatafile)
Error in gzfile(file, "wb") : cannot open the connection
Calls: get_msigdata -> save -> gzfile
In addition: Warning message:
In gzfile(file, "wb") :
  cannot open compressed file '~/myscratch/cdiff/ChromDiff/msigdb/rdata/c2_c5.symbols.rdata', probable reason 'No such file or directory'
Execution halted
angieyen commented 8 years ago

Hi Samir,

ChromDiff looks for a saved rdata file, and if its not available, it should load it from the original MSigDB information.

Can you tell me if the folder ~/myscratch/cdiff/ChromDiff/msigdb/rdata exists? If so, what files are in it? Similarly, does the folder ~/myscratch/cdiff/ChromDiff/msigdb/genesets exists? If so, does it contain a file named "c2_c5.all.v4.0.symbols.gmt", as shown here: https://github.com/angieyen/ChromDiff/tree/master/msigdb/genesets

My guess is the problem is that the "rdata" folder does not exist, so when it tries to save the rdata file, R gives an error. Creating an rdata folder should fix the problem, and if that's the case, I can definitely quickly deploy a fix in ChromDiff to first check if the folder exists, and create it if it does not.

Thanks! Angela

On Tue, Apr 19, 2016 at 2:24 PM, Samir B. Amin notifications@github.com wrote:

Here is some debug on get_msigdata R function. Looks like there is missing msigdata object.

debug: matched.genes = convert(gencode.genes, idtype) debug: final.genes = remove.extra(matched.genes) debug: print("Getting enrichments...") [1] "Getting enrichments..." debug: gene.universe = get_gene.universe(idtype) debug: results = get_enrichments(msigdata, final.genes, gene.universe) Error in apply(msigdata, 1, function(msigdbgroup) { : object 'msigdata' not found Calls: get_msigdata -> get_enrichments -> apply debug: if (file.exists(rdatafile)) { load(rdatafile) } else { msigfile = paste0(HOMEDIR, "msigdb/genesets/", msigdbset, ".all.v4.0.", idtype, ".gmt") sprintf("msigfile is at %s, rdatafile is at %s", msigfile, rdatafile) numcols = count.fields(msigfile) msigdata = read.table(msigfile, fill = TRUE, col.names = 1:max(numcols)) save(msigdata, file = rdatafile) } debug: msigfile = paste0(HOMEDIR, "msigdb/genesets/", msigdbset, ".all.v4.0.", idtype, ".gmt") debug: sprintf("msigfile is at %s, rdatafile is at %s", msigfile, rdatafile) debug: numcols = count.fields(msigfile) debug: msigdata = read.table(msigfile, fill = TRUE, col.names = 1:max(numcols)) debug: save(msigdata, file = rdatafile) Error in gzfile(file, "wb") : cannot open the connection Calls: get_msigdata -> save -> gzfile In addition: Warning message: In gzfile(file, "wb") : cannot open compressed file '~/myscratch/cdiff/ChromDiff/msigdb/rdata/c2_c5.symbols.rdata', probable reason 'No such file or directory' Execution halted

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/angieyen/ChromDiff/issues/2#issuecomment-212055624

dyndna commented 8 years ago

Hi Angela,

rdata directory was not present. I created it and code ran until following error:

...
debug: matched.genes = convert(gencode.genes, idtype)
debug: final.genes = remove.extra(matched.genes)
debug: print("Getting enrichments...")
[1] "Getting enrichments..."
debug: gene.universe = get_gene.universe(idtype)
debug: results = get_enrichments(msigdata, final.genes, gene.universe)
Error in apply(msigdata, 1, function(msigdbgroup) { :
  object 'msigdata' not found
Calls: get_msigdata -> get_enrichments -> apply

msigdb/genesets exists and contains,

c2.all.v4.0.entrez.gmt c2.all.v4.0.symbols.gmt c2_c5.all.v4.0.entrez.gmt c2_c5.all.v4.0.symbols.gmt c5.all.v4.0.entrez.gmt c5.all.v4.0.symbols.gmt

Debug:

...
[1] "Calculating enrichments for ../plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/..."
debug: pval.table = read.table(allpvalfile)  
debug: names(pval.table) = c("feat", "pval") 
debug: if (nrow(pval.table) != length(get_featnames(model, metric, reldir))) {
    stop(paste("Problem calculating enrichments for ", plotdir,
        ": wrong number of entries", sep = ""))
}
Read 299025 items
debug: gencode.genes = get_sig_genes(pval.table)
debug: if (length(gencode.genes) > 0) {
    matched.genes = convert(gencode.genes, idtype)
    final.genes = remove.extra(matched.genes)
    print("Getting enrichments...")
    gene.universe = get_gene.universe(idtype)
    results = get_enrichments(msigdata, final.genes, gene.universe)
    if (!is.matrix(results) || nrow(results) > 0) {
        dir.create(dirname(outfile), recursive = TRUE, showWarnings = FALSE)
        write.table(results, quote = FALSE, file = outfile, row.names = FALSE,
            sep = "\t")
    }
    rm(results)
    if (clusts) {
        vert.ints.file = get_vert.ints.file(plotdir, plottypesuffix)
        if (!file.exists(vert.ints.file)) {  
            stop(paste0("File ", vert.ints.file, " does not exist."))
        }
        load(vert.ints.file)
        if (!(exists("vert.ints.noreps") && exists("minclustsize"))) {
            stop(paste0(vert.ints.file, " did not load vert.ints.noreps and minclustsize."))
        }
        if (!(is.list(vert.ints.noreps) && length(vert.ints.noreps) >
            0)) {
            stop(paste0("vert.ints.noreps was not a list or was length 0"))
        }
        vert.ints.noreps = order.list.by.first(vert.ints.noreps)
        geneorderfile = get_geneorder_file(plotdir, suffix, plottypesuffix)
        if (!file.exists(geneorderfile)) {   
            stop(paste0(geneorderfile, " does not exist, so can not run clustering analysis."))
        }
        gencode.genes = as.vector(t(read.table(geneorderfile,
            skip = 1)))
        final.genes = convert(gencode.genes, idtype)
        clustcount = 1
        for (clustind in 1:length(vert.ints.noreps)) {
            currclust.indices = vert.ints.noreps[[clustind]]
            if ((currclust.indices[2] - currclust.indices[1]) >
                minclustsize) {
                outfile = get_msigdb_clustfile(reldir, msigdbset,
                  idtype, model, metric, test, correction, property,
                  a.label, b.label, clustcount)
                if (!overwrite && file.exists(outfile) && file.info(outfile)$size >
                  0) {
                  next
                }
                currclust = unique(final.genes[(currclust.indices[1] -
                  0.5):(currclust.indices[2] + 0.5)])
                finalclust = remove.extra(currclust)
                results = get_enrichments(msigdata, finalclust,
                  gene.universe)
                if (!is.matrix(results) || nrow(results) > 0) {
                  dir.create(dirname(outfile), recursive = TRUE,
                    showWarnings = FALSE)
                  write.table(results, quote = FALSE, file = outfile,
                    row.names = FALSE, sep = "\t")
                }
                rm(results)
                clustcount = clustcount + 1  
            }
        }
    }
}
debug: matched.genes = convert(gencode.genes, idtype)
debug: final.genes = remove.extra(matched.genes)
debug: print("Getting enrichments...")
[1] "Getting enrichments..."
debug: gene.universe = get_gene.universe(idtype)
debug: results = get_enrichments(msigdata, final.genes, gene.universe)
Error in apply(msigdata, 1, function(msigdbgroup) { :
  object 'msigdata' not found
Calls: get_msigdata -> get_enrichments -> apply
debug: if (file.exists(rdatafile)) {
    load(rdatafile)
} else {
    msigfile = paste0(HOMEDIR, "msigdb/genesets/", msigdbset,
        ".all.v4.0.", idtype, ".gmt")
    sprintf("msigfile is at %s, rdatafile is at %s", msigfile,
        rdatafile)
    numcols = count.fields(msigfile)
    msigdata = read.table(msigfile, fill = TRUE, col.names = 1:max(numcols))
    save(msigdata, file = rdatafile)
}
debug: msigfile = paste0(HOMEDIR, "msigdb/genesets/", msigdbset, ".all.v4.0.",
    idtype, ".gmt")
debug: sprintf("msigfile is at %s, rdatafile is at %s", msigfile, rdatafile)
debug: numcols = count.fields(msigfile)
debug: msigdata = read.table(msigfile, fill = TRUE, col.names = 1:max(numcols))
debug: save(msigdata, file = rdatafile)
debug: if (idtype == "symbols") {

    filt.msigdatafile = paste(HOMEDIR, "msigdb/rdata/", msigdbset,
        ".", idtype, ".filtered.rdata", sep = "")
    if (file.exists(filt.msigdatafile)) {
        load(filt.msigdatafile)
    }
    else {
        mappedsymbols = as.vector(t(read.table(gencodemappedfile)))
        newmsigdata = apply(msigdata, 1, function(msigdbgroup) {
            currgroupgenes = msigdbgroup[3:length(msigdbgroup)]
            indstoconvert = which(!(currgroupgenes %in% mappedsymbols))
            currgroupgenes[indstoconvert] = ""
            return(c(msigdbgroup[1:2], currgroupgenes))
        })
        msigdata = t(newmsigdata)
        save(msigdata, file = filt.msigdatafile)
    }
}
debug: filt.msigdatafile = paste(HOMEDIR, "msigdb/rdata/", msigdbset,
    ".", idtype, ".filtered.rdata", sep = "")
debug: if (file.exists(filt.msigdatafile)) { 
    load(filt.msigdatafile)
} else {
    mappedsymbols = as.vector(t(read.table(gencodemappedfile)))
    newmsigdata = apply(msigdata, 1, function(msigdbgroup) {
        currgroupgenes = msigdbgroup[3:length(msigdbgroup)]
        indstoconvert = which(!(currgroupgenes %in% mappedsymbols))
        currgroupgenes[indstoconvert] = ""   
        return(c(msigdbgroup[1:2], currgroupgenes))
    })
    msigdata = t(newmsigdata)
    save(msigdata, file = filt.msigdatafile) 
}
debug: mappedsymbols = as.vector(t(read.table(gencodemappedfile)))
debug: newmsigdata = apply(msigdata, 1, function(msigdbgroup) {
    currgroupgenes = msigdbgroup[3:length(msigdbgroup)]
    indstoconvert = which(!(currgroupgenes %in% mappedsymbols))
    currgroupgenes[indstoconvert] = ""
    return(c(msigdbgroup[1:2], currgroupgenes))
})
debug: msigdata = t(newmsigdata)
debug: save(msigdata, file = filt.msigdatafile)
debug: return(msigdata)
exiting from: get_msigdata(msigdbset, idtype)
dyndna commented 8 years ago

So, I ran ChromDiff again (cd ChromDiff && ./notes_v2.sh) and it worked! For the first time run, I think function, get_msigdata or subsequent function (in msigdb_funcs.R file) expects msigdata as rdatafile but it fails as rdatafile has not been generated yet.

Is following an expected end of ChromDiff run? I am on headless server and wonder where can I find generated results/figures. When I do git status, I find only two additonal files, tmp.brain.gi.rdata and tmp.rdata

Thanks, Samir

Rscript mk_sig_maj_heatmap.R perc wilcox fdr sex Female Male Female Male core_gencode_v10 domstate data/final_celltype_metadata.txt data/gencode_genes_full.txt data/cov.mat.txt data/map_vars_covariates.txt data/57epigenomes.RPKM.pc data/core_annotation.txt gencode_v10
[1] "Ordering columns from full heatmap..."
[1] "Ordering rows..."
[1] "Mapping to colors..."
[1] "Reordering if necessary..."
[1] "Loading plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/sig_maj_plot_dend_domstate.rdata ..."
[1] "Generating final heatmap..."
[1] "Plotting to plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/sig_maj_plot_matched_domstate.pdf ..."
null device 
          1 
[1] "Writing to plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/sig_maj_geneorder_matched_domstate.txt"
Generating expression heatmap...
[1] "Loading in celltype, gene, and chromatin state ordering..."
[1] "Loading plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/sig_maj_plot_dend_domstate.rdata ..."
[1] "Generating final heatmap..."
[1] "Saving mat.to.plot to plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/exp.mat.to.plot_domstate.Rdata..."
Generating clustered heatmap...
[1] "Loading plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/sig_maj_plot_dend_domstate.rdata ..."
null device 
          1 
[1] "Making Dominant chromatin state plot..."
[1] "Writing to plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/sigfeats_feats_clust_vert_ints_domstate.rdata"
[1] "Making Gene expression differences plot..."
[1] "Loading plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/sig_maj_plot_dend_domstate.rdata ..."
There were 50 or more warnings (use warnings() to see the first 50)
There were 50 or more warnings (use warnings() to see the first 50)
There were 50 or more warnings (use warnings() to see the first 50)
There were 50 or more warnings (use warnings() to see the first 50)
[1] "getting msigdata using args: c2_c5 and symbols; workdir is /scratch/genomic_med/gm_dep/samin1/myscratch/cdiff/ChromDiff/msigdb"
[1] "Calculating enrichments for ../plots/core_gencode_v10/Female.Male/perc/wilcox/fdr/..."
Read 299025 items
[1] "Getting enrichments..."
[1] "Number of unique gene ids used for enrichment: 373"
dyndna commented 8 years ago

Hi Angela, I think I am getting all plots and results in text files. Not sure what those warnings are about for time being, I will go through result output and see if it makes sense. Thanks for your feedback.

angieyen commented 8 years ago

Glad you got it to work. ChromDiff does not require a msigdata (rdata object) on the first run, as it can load it from the raw data, but as I mentioned, you did find a bug that the rdata directory did not exist. Thanks for creating the rdata directory manually for your test run - I just updated the code so that other people should not have this problem.

In the README file, we describe where the plots and results are located in "#### PART 4: Finding results generated by ChromDiff #######"

The warnings are likely referring to the fact that "ties" in the feature values result in inexact p-value calculations, but the effect on results should be very small.