SGDDNB / translational_regulation

Detection of differential translated genes using Ribo-seq
14 stars 11 forks source link

error with apeglm #2

Closed imilenkovic closed 3 years ago

imilenkovic commented 3 years ago

Hello,

I tried to run the R script on my data, but it failed since the package 'apeglm' was missing. I installed it and reran the analysis, but then I got the following error: estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing Error in lfcShrink(ddsMat_ribo, contrast = c("Condition", "2", "1"), res = res_ribo) : type='apeglm' shrinkage only for use with 'coef' Execution halted

I tried running the sample data and I had the same error (what I am showing is actually from this sample data run). Any ideas how this can be solved?

Thanks a lot!

soniachothani commented 3 years ago

Hi,

Thank you for your question.

Can I check with you the version of DESeq2 you are using, you can do so by typing packageVersion("DESeq2").

There has been an update in DESeq2 after V1.28 where the default shrinkage method is now "apeglm". Check https://rdrr.io/bioc/DESeq2/man/lfcShrink.html for more details.

CHANGES IN VERSION 1.28.0

o For lfcShrink(), changed order of type options: "normal" will no longer be first, as it under-performed "apeglm" and "ashr" in Zhu et al (2018). The new order is type=c("apeglm", "ashr", "normal").

If you use apeglm = you need to specify the coef rather than contrast. So the line lfcShrink(ddsMat_ribo, contrast = c("Condition", "2", "1"), res = res_ribo) would change to lfcShrink(ddsMat_ribo, coef=coef_num, res = res_ribo)

You would have to identify the coefficient number by typing resultsNames(ddsMat_ribo) and look for the index of Condition_2_1 which is 2 (coef=2) in this context. You will need to do this also for res_rna: lfcShrink(ddsMat_rna, coef=coef_num, res = res_rna).

Thanks a lot for pointing this out. I have now updated this in the repository. Let me know if this works for you now or if you have any further issues. Happy to help!

Best, Sonia

imilenkovic commented 3 years ago

Hello,

I checked my DESeq2, and the version is 1.30.1.

I installed apeglm in R and pulled the repo to update the R script, but now the error that I am receiving is the following:

` ... Warning message: In DESeqDataSet(se, design = design, ignoreRank) : some variables in design formula are characters, converting to factors estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing [1] "Intercept" "Batch_2_vs_1" "Batch_3_vs_1" [4] "Condition_2_vs_1" "SeqType_RIBO_vs_RNA" "Condition2.SeqTypeRIBO" mkdir: cannot create directory ‘Results’: File exists mkdir: cannot create directory ‘fold_changes’: File exists mkdir: cannot create directory ‘gene_lists’: File exists

out of 17026 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 6, 0.035% LFC < 0 (down) : 6, 0.035% outliers [1] : 0, 0% low counts [2] : 0, 0% (mean count < 0) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results

[1] 11 Warning message: In DESeqDataSet(se, design = design, ignoreRank) : some variables in design formula are characters, converting to factors estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing using 'apeglm' for LFC shrinkage. If used in published research, please cite: Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895 Warning message: In DESeqDataSet(se, design = design, ignoreRank) : some variables in design formula are characters, converting to factors estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing using 'apeglm' for LFC shrinkage. If used in published research, please cite: Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895 Error in plot.window(...) : need finite 'ylim' values Calls: plot -> plot -> plot.default -> localWindow -> plot.window Execution halted `

The two lfcShrink line are as follows: res_ribo <- lfcShrink(ddsMat_ribo, coef=2,res=res_ribo,type="apeglm") res_rna <- lfcShrink(ddsMat_rna, coef=2,type="apeglm",res=res_rna)

Do you maybe know what I did wrong here? Thanks!

imilenkovic commented 3 years ago

Also, another thing that I see is that in my data, only 11 genes are designated as differentially translated (the DTEGs.txt list); however when I check most of them in the input data I see the following:

RiboSeq counts (WT1, WT2, WT3, KO1, KO2, KO3) Pkhd1-201 1 3 2 19 0 0

RNASeq counts (WT1, WT2, WT3, KO1, KO2, KO3) Pkhd1-201 0 3 0 0 0 0

So I don't think this (nor most of the other genes) should have ended up as significantly different, do you maybe have an idea why this happens? Thanks a lot!

soniachothani commented 3 years ago

Hi, 

I think this error could be the issue: "some variables in design formula are characters, converting to factors" Could you share the top few rows of your two count matrices and also your sample info file so I can better understand the issue?

Do you get similar errors when you run the script for the sample data given on github, or does that work well now?

Let me know!

Best wishes,  Sonia 

imilenkovic commented 3 years ago

Hi Sonia,

I also get the error "some variables in design formula are characters, converting to factors" when running the sample data; however, I do not get the plotting error and the results are successfully produced.

Here's my ribo_counts: KO1_ribo KO2_ribo KO3_ribo WT1_ribo WT2_ribo WT3_ribo 0610009B22Rik-202 184 204 131 119 122 121 0610010F05Rik-201 44 47 52 33 23 37 0610010K14Rik-204 0 2 4 0 0 6 0610012G03Rik-201 167 120 138 128 107 134 0610030E20Rik-201 38 27 29 34 23 31 0610040J01Rik-201 14 5 10 18 11 4

rnaseq_counts: KO1_rna KO2_rna KO3_rna WT1_rna WT2_rna WT3_rna 0610009B22Rik-202 30 17 46 26 28 46 0610010F05Rik-201 6 12 18 14 15 13 0610010K14Rik-204 1 4 3 0 2 2 0610012G03Rik-201 151 118 249 91 162 218 0610030E20Rik-201 0 1 5 1 1 3 0610040J01Rik-201 3 3 4 2 3 9 1110002E22Rik-201 27 25 33 17 22 31 1110004F10Rik-201 112 70 125 72 88 138 1110008P14Rik-201 69 77 102 91 94 99

sample_info:

SampleID Condition SeqType Batch KO1_ribo 1 RIBO 1 KO2_ribo 1 RIBO 2 KO3_ribo 1 RIBO 3 WT1_ribo 2 RIBO 1 WT2_ribo 2 RIBO 2 WT3_ribo 2 RIBO 3 KO1_rna 1 RNA 1 KO2_rna 1 RNA 2 KO3_rna 1 RNA 3 WT1_rna 2 RNA 1 WT2_rna 2 RNA 2 WT3_rna 2 RNA 3

imilenkovic commented 3 years ago

What I noticed now is that when running sample data, lowly expressed transcripts are successfully filtered out:

out of 17912 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 922, 5.1% LFC < 0 (down) : 911, 5.1% outliers [1] : 0, 0% low counts [2] : 6876, 38% (mean count < 49)

but when running my data this doesn't happen: out of 17026 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 6, 0.035% LFC < 0 (down) : 6, 0.035% outliers [1] : 0, 0% low counts [2] : 0, 0% (mean count < 0)

soniachothani commented 3 years ago

Hey,

At the moment for your data do you see any results in the Results directory? Are you able to draw a PCA with your count data for Ribo-seq or RNA-seq? Do the WT and KO cluster separately, if yes what is the variance % on PC1 and PC2.

If you have no results in the Results directory, I would suggest loading the script in R and running step by step for your data. It seems either there is some specific-format issue or there might actually be very limited differential expression/translation when accounting for batch.

When running DTEG.R, you can skip these lines in the script: ///// args = commandArgs(trailingOnly=TRUE) /#test if there is at least one argument: if not, return an error if (length(args)!=4) { stop("Ribo-seq counts, RNA-seq counts and Sample Information and batch presence (0/1) should be supplied.n", call.=FALSE) } /////

and, edit the following lines to replace args[1],args[2],args[3],args[4] with the four arguments as you had in your command line run. /# Input filenames ribo_file <- args[1] rna_file <- args[2] data_file <- args[3] batch <- args[4]

After which you can run the script line by line and let me know where you hit an error. :)

Best, Sonia

imilenkovic commented 3 years ago

Hi Sonia,

Thanks a lot for all the answers :)

I do get some results and the PCA is produced (attaching it here:) image

and as expected there's not evident separate clustering of condition 1 and condition 2. This goes in line with the analysis I've done using another software, but I would like to validate the results by using this one too.

Fold_changes and gene_lists folders are created, and in the gene_lists folders there are 7 genes in the DTEGs.txt file. However, as I was saying earlier, most of these genes should have been filtered out since they all behave similar to this one:

RiboSeq counts (WT1, WT2, WT3, KO1, KO2, KO3) Pkhd1-201 1 3 2 19 0 0

RNASeq counts (WT1, WT2, WT3, KO1, KO2, KO3) Pkhd1-201 0 3 0 0 0 0

so I don't quite understand how these genes ended up in the final DTE list.

Do you think I should still edit the script and run it line by line or is there something else happening here? :)

Thanks!

Best, Ivan

soniachothani commented 3 years ago

Hi Ivan,

No worries. Thank you for the PCA - so it seems one of the sample in Condition 2 is more close to Condition 1 and therefore you may not find any DTGs, I am assuming that is your RNA-seq PCA.

Regarding the low counts issue, I think it could be in general you have low counts in your data for most genes since the DESeq2 summary also sets (mean count < 0) as the threshold. You can plot the count distribution (alternatively MA plot) in your data and check the library sizes of each of the samples to check if this is true. As you can see in the sample data the threshold for low counts is larger (mean counts < 49), therefore there are many genes flagged with low counts. Hope that helps. Let me know how it goes.

Best, Sonia