JEFworks-Lab / HoneyBADGER

HMM-integrated Bayesian approach for detecting CNV and LOH events from single-cell RNA-seq data
http://jef.works/HoneyBADGER/
GNU General Public License v3.0
95 stars 31 forks source link

showing NULL in the step of calcGexpCnvBoundaries - getting started tutorial #43

Closed Rongtingting closed 3 years ago

Rongtingting commented 3 years ago

Hi HoneyBADGER developers,

Thank you for developing this tool! I tried the getting-statrted-tutorial to make sure that i can use the tool. However, i am stucked in the step of calcGexpCnvBoundaries. I used the demo data provided in the tool, but the results were not the same as yours.

honeyBADGER

But I can not find out why there is NULL. Could you give me some instructions on how to figure it out? Thanks a lot for your time!!!

Rongtingting commented 3 years ago

When I tried mart.obj <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", dataset="hsapiens_gene_ensembl"), the error gone. I think that the default version is hg38 (useMart function, host="www.ensembl.org"), but the demo dataset of honeyBADGER uses the hg19 (host="grch37.ensembl.org"), and the version "jul2015.archive.ensembl.org" may be private for author's accout.

biojiangke commented 3 years ago

I got the exact "NULL" result from default ensembl genes (and all the following warnings etc.), but the solution is not working for me. I still got "NULL" from hg19. Any other thoughts?

Rongtingting commented 3 years ago

I got the exact "NULL" result from default ensembl genes (and all the following warnings etc.), but the solution is not working for me. I still got "NULL" from hg19. Any other thoughts?

have you tried more steps? pls have a look at the results of calcGexpCnvBoundaries step.

biojiangke commented 3 years ago

If 'host="grch37.ensembl.org"' is used for the "mart.obj" object, "calcGexoCnvBoundaries" would give it a "NULL" result. And the "regions.genes" is completely empty.

Rongtingting commented 3 years ago

If 'host="grch37.ensembl.org"' is used for the "mart.obj" object, "calcGexoCnvBoundaries" would give it a "NULL" result. And the "regions.genes" is completely empty.

use the hg19,

hb$calcGexpCnvBoundaries(init=TRUE, verbose=FALSE) ERROR: Error: subscript contains invalid names ERROR: Error: subscript contains invalid names NULL

it seems get error again, however it can be ignored since the following steps can be run, and it also got some results (but different from the demo‘s results...

biojiangke commented 3 years ago

Is "regions.genes" empty, without any genomic intervals? Then "summarizeResults" would complain about empty results? In the end, this may not matter. Custom data may run through without problems. But, the fact that the tutorial example is not reproducible (even somewhat), is kinda disturbing.

Rongtingting commented 3 years ago

The regions.genes is not empty, but the tutorial example is indeed not reproducible...

print(regions.genes) GRanges object with 4 ranges and 0 metadata columns: seqnames ranges strand

[1] chr11 167784-77185680 * [2] chr7 855528-158749438 * [3] chr9 134000948-141019076 * [4] chr10 320130-135187193 * ------- seqinfo: 52 sequences from an unspecified genome; no seqlengths
biojiangke commented 3 years ago

Now it seems weirder: I got "seqinfo: 28 sequences from an unspecified genome; no seqlengths" from this step with an empty "regions.genes". Clearly there are multiple versions of reference assemblies going around here. The question is where the root is. I'd assume this comes from the mart.obj, but shouldn't we get the same "hsapiens_gene_ensembl" if we use the same "host" and "dataset" at the "useMart" step?

Rongtingting commented 3 years ago

yes, i think we shouldn't get the totally different output.

JEFworks commented 3 years ago

Dear Rongtingting,

Thank you for taking the initiative to address the issue you discovered and sharing the solution. Indeed, the data included with the package was aligned to hg19. Back when this paper was originally published and this subsequent tutorial released, biomaRt's default version was hg19 and has indeed since been updated. The exact version used for both the paper and tutorial is the assembly from July 2015! The full set of changes to the human genome since hg19vJuly2015 can be found here: http://useast.ensembl.org/info/website/archives/index.html

There are at least a few reasons why using a different genome version may produce slightly different results. One, the gene symbols/names may have changed. So a gene that is included in the built in data can no longer be found in the new assembly. Two, the gene coordinates may have changed. This will affect the exact genomic coordinates represented by the genes and subsequently the exact genomic coordinates of the CNVs identified. Three, newer genome assemblies may also have different alternative contig names (regions.genes@seqnames that are not chromosomes 1 through 22), though this should not impact the final results, which are limited to chromosomes 1 through 22 anyway (thought you may find different numbers of sequences from unspecified genomes to what is noted in the tutorial).

The version and seed of JAGs runs you use may also play a minor role since HMMs are stochastic after all. You should also double check that JAGs is installed and running correctly since it is external to the R environment.

All this may all impact the exact coordinates of the CNVs identified, in particular before retestIdentifiedCnvs is used to filter out spurious/non-confident identified CNVs. However, the final set of identified CNVs on chromosomes 5, 7, 20, 10 13, and 14 should be reproducible though, especially if you are able to reproduce the figure from hb$plotGexpProfile().

The tutorial is compiled from the Rmarkdown under https://github.com/JEFworks-Lab/HoneyBADGER/blob/master/vignettes/Getting_Started.Rmd in case you would like to recompile it from there instead of copying and pasting from the tutorial.

Hope that clarifies some things.

Stay healthy and safe, Prof. Jean Fan

biojiangke commented 3 years ago

Since "jul2015.archive.ensembl.org" is not available at this point (the archive used in the Rmd), the closest ones on ensembl archive list are "may2015" and "sep2015" archives. I tried both and "sep2015" is generating the closest results to the tutorial. With a few snags, the tutorial will run up to the "using allele information" part, which I haven't tested yet. Indeed, the "amps" on chr5, 7, 20, and "dels" on chr10, 13, 14 showed up (with some extra "dels" on chr6, 9, 11). One minor suggestion might be to update the documentation with some specific information about which ensembl archive(s) might generate similar results, because the original "jul2015" is not accessible now.

Rongtingting commented 3 years ago

Dear Rongtingting,

Thank you for taking the initiative to address the issue you discovered and sharing the solution. Indeed, the data included with the package was aligned to hg19. Back when this paper was originally published and this subsequent tutorial released, biomaRt's default version was hg19 and has indeed since been updated. The exact version used for both the paper and tutorial is the assembly from July 2015! The full set of changes to the human genome since hg19vJuly2015 can be found here: http://useast.ensembl.org/info/website/archives/index.html

There are at least a few reasons why using a different genome version may produce slightly different results. One, the gene symbols/names may have changed. So a gene that is included in the built in data can no longer be found in the new assembly. Two, the gene coordinates may have changed. This will affect the exact genomic coordinates represented by the genes and subsequently the exact genomic coordinates of the CNVs identified. Three, newer genome assemblies may also have different alternative contig names (regions.genes@seqnames that are not chromosomes 1 through 22), though this should not impact the final results, which are limited to chromosomes 1 through 22 anyway (thought you may find different numbers of sequences from unspecified genomes to what is noted in the tutorial).

The version and seed of JAGs runs you use may also play a minor role since HMMs are stochastic after all. You should also double check that JAGs is installed and running correctly since it is external to the R environment.

All this may all impact the exact coordinates of the CNVs identified, in particular before retestIdentifiedCnvs is used to filter out spurious/non-confident identified CNVs. However, the final set of identified CNVs on chromosomes 5, 7, 20, 10 13, and 14 should be reproducible though, especially if you are able to reproduce the figure from hb$plotGexpProfile().

The tutorial is compiled from the Rmarkdown under https://github.com/JEFworks-Lab/HoneyBADGER/blob/master/vignettes/Getting_Started.Rmd in case you would like to recompile it from there instead of copying and pasting from the tutorial.

Hope that clarifies some things.

Stay healthy and safe, Prof. Jean Fan

Dear Prof. Fan,

Thank you for taking the time to help us in this isssue. Yes, different version of rjags might cause little difference during the sampling.

With the demo data provided by the pcakage, both expression and allele info part can be run following the getting started tutorial.

However, I found that the last step which combine the expression and allele information can not get results! Could you give me some instructions on how to figure it out? (The log of the last step is attached)

Thanks a lot for your time!!!

hb$retestIdentifiedCnvs(retestBoundGenes=TRUE, retestBoundSnps=TRUE, verbose=FALSE) WARNING! ONLY 9 SNPS IN REGION! WARNING! ONLY 2 SNPS IN REGION! Compiling model graph Resolving undeclared variables Allocating nodes Graph information: Observed stochastic nodes: 30095 Unobserved stochastic nodes: 37372 Total graph size: 431029

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% |**| 100% |**| 100% Compiling model graph Resolving undeclared variables Allocating nodes Graph information: Observed stochastic nodes: 46958 Unobserved stochastic nodes: 64285 Total graph size: 754590

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% |**| 100% |**| 100% Compiling model graph Resolving undeclared variables Allocating nodes Graph information: Observed stochastic nodes: 3548 Unobserved stochastic nodes: 4150 Total graph size: 27465

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% |**| 100% |**| 100% ERROR! ONLY 1 GENES IN REGION! Compiling model graph Resolving undeclared variables Allocating nodes Graph information: Observed stochastic nodes: 13842 Unobserved stochastic nodes: 10094 Total graph size: 69120

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% |**| 100% |**| 100% Compiling model graph Resolving undeclared variables Allocating nodes Graph information: Observed stochastic nodes: 20082 Unobserved stochastic nodes: 13934 Total graph size: 102837

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% |**| 100% |**| 100% Compiling model graph Resolving undeclared variables Allocating nodes Graph information: Observed stochastic nodes: 8578 Unobserved stochastic nodes: 7005 Total graph size: 40610

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% |**| 100% |**| 100%

results <- hb$summarizeResults(geneBased=TRUE, alleleBased=TRUE) Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 7, 6

image