DGA issue with edgeR contrast matrix, error in `dat_path[["salmon"]]`: ! subscript out of bounds

methornton commented 1 year ago

Hello!

Even though I am halted on the heatmap function still, I had the normalization files "list_norm_dgls" and I was able to make a "piRNA_predicted_Targets_rheMac10.txt". I'm partial to the voomQW_Quantile normalization. I figured I would plunge ahead with the DGA analysis. I am used to using edgeR and making the contrast matrix. I have five groups and five comparisons. When I put that information in it seems to work until the "map" step, where it gives the subscript out of bounds error.

Here is the top half of my DGA script which is mostly identical to the TCGA_BRCA script.


library('tidyverse')
library('edgeR')
library('DESeq2')

## Date format
todate <- format(Sys.time(), "%d_%b_%Y")

### Make the directory for the results of the exploratory data analysis
my_basename <- file.path(".") ## INPUT name of the main folder
my_exp <- "Barry_19138" ## INPUT name of the analysis
genome_input <- "rheMac10" ## INPUT genome version here
my_tools <- c("salmon","featureCounts")
dat_path <- file.path(my_basename, str_glue("DGA_{my_exp}_{genome_input}_{todate}"), my_tools)
dat_path %>% map(~dir.create(., recursive = TRUE))

## Import the normalized files
list_norm_dgls <- list.files(path = my_basename, pattern = "list_norm_dgls.+rds", recursive = TRUE, full.names = TRUE)

# load salmon normalized files
salmon_norm <- list_norm_dgls %>% unlist %>% str_detect("salmon") %>% list_norm_dgls[.] %>% read_rds()

# load featurecounts normalized files
fc_norm <- list_norm_dgls %>% unlist %>% str_detect("featureCounts") %>% list_norm_dgls[.] %>% read_rds()

## Extract normalized objects
## I choose TMM  and VoomQW_Quantile
salmon_edgR_TMM <- salmon_norm[["TMM"]]
salmon_vm_QW_Q <- salmon_norm[["voomQW_Quantile"]]
fc_edgR_TMM <- fc_norm[["TMM"]]
fc_vm_QW_Q <- fc_norm[["voomQW_Quantile"]]

## Extract the design matrix

# voom object
design <- salmon_vm_QW_Q$design

# edgeR
# design ----
colnames(design) # check the names and make the contrasts
con_mat <- makeContrasts( AceTrnvAceExp = EXP.Ace - TRN.Ace, LyralTrnvLyralExp = EXP.Lyral - TRN.Lyral, levels = design)

## salmon ----
salmon_edgR_TMM <- estimateDisp(salmon_edgR_TMM, design = design, robust=TRUE)
salmon_edgR_TMM <- glmQLFit(salmon_edgR_TMM, design, robust = TRUE)

DE_salmon_edgR <- con_mat %>% colnames() %>% set_names() %>%
        map(~glmQLFTest(salmon_edgR_TMM, contrast = con_mat[,.x]) %>%
            topTags(n = nrow(.), adjust.method = "BH", sort.by = "PValue", p.value = 1) %>%
            .$table %>%
            as_tibble(rownames = "smallRNA") %>%
            write_tsv(file.path(dat_path[['salmon']], str_c("DE_salmon_edgR_TMM_", .x, ".txt")))
        )

pdf(file.path(dat_path[['salmon']],str_c("hist_edger_p_value_", names(DE_salmon_edgR[1]),".pdf") ))
hist(DE_salmon_edgR[[1]]$PValue, breaks = 0:20/20,col = "grey50", border = "white")
dev.off()

Here is the error it also occurs with featureCounts data


> DE_salmon_edgR
            CtrvAce           CtrvLyral       AceTrnvAceExp   LyralTrnvLyralExp 
          "CtrvAce"         "CtrvLyral"     "AceTrnvAceExp" "LyralTrnvLyralExp" 
            TrnvExp 
          "TrnvExp" 
> DE_salmon_edgR <- con_mat %>% colnames() %>% set_names() %>%
+ map(~glmQLFTest(salmon_edgR_TMM, contrast = con_mat[,.x])
+ %>% topTags(n = nrow(.), adjust.method = "BH", sort.by = "PValue", p.value = 1)
+ 
> DE_salmon_edgR <- con_mat %>% colnames() %>% set_names() %>%
+         map(~glmQLFTest(salmon_edgR_TMM, contrast = con_mat[,.x]) %>%
+             topTags(n = nrow(.), adjust.method = "BH", sort.by = "PValue", p.value = 1) %>%
+             .$table %>%
+             as_tibble(rownames = "smallRNA") %>%
+             write_tsv(file.path(dat_path[['salmon']], str_c("DE_salmon_edgR_TMM_", .x, ".txt")))
+         )
Error in `map()`:
ℹ In index: 1.
ℹ With name: CtrvAce.
Caused by error in `dat_path[["salmon"]]`:
! subscript out of bounds
Run `rlang::last_trace()` to see where the error occurred.
> fc_edgR_TMM <- estimateDisp(fc_edgR_TMM, design = design, robust=TRUE)
> fc_edgR_TMM <- glmQLFit(fc_edgR_TMM, design, robust = TRUE)
> 
> DE_FC_edgR <- con_mat %>% colnames() %>% set_names() %>%
+         map(~glmQLFTest(fc_edgR_TMM, contrast = con_mat[,.x]) %>%
+             topTags(n = nrow(.), adjust.method = "BH", sort.by = "PValue", p.value = 1) %>%
+             .$table %>%
+             as_tibble(rownames = "smallRNA") %>%
+             write_tsv(file.path(dat_path[['featureCounts']],str_c("DE_fc_edgR_TMM",.x ,".txt")))
+         )
Error in `map()`:
ℹ In index: 1.
ℹ With name: CtrvAce.
Caused by error in `dat_path[["featureCounts"]]`:
! subscript out of bounds

Getting closer! I really appreciate your help. Thank you!

ConYel commented 1 year ago

Hello @methornton and thank you very much for your effort. Could you add this:

dat_path <- file.path(my_basename, str_glue("DEA_{my_exp}_{genome_input}_{todate}"),
                      my_tools) %>% set_names(my_tools)

and rerun it? It seems that the vectors of dat_path don't have the proper names.

methornton commented 1 year ago

Hello! @ConYel ! So this addition helped, I have four conditions in cond_mat, the program will make a histogram for the first one, but then hangs and doesn't error out. A venn diagram is made for only the first condition in the root DGA folder, venn_diagram_DE_salmon_fC_edgeR_CtrvAce.pdf, for my salmon filter there only remained 204 scnRNA and in the venn diagram there are no significant genes, which lists 204 up 204 dn, there may be some for the other conditions. When I check the DE tsv for that condition for salmon and fc, there are indeed no significant genes. I cam remove the condition, but it would be nice to be able to keep it so i can show the PI that there were no significant sncRNA.

the script processes to generate these files, for featureCounts a well, and then hangs


DE_salmon_edgR_TMM_AceTrnvAceExp.txt      DE_salmon_vm_QW_Q_CtrvAce.txt
DE_salmon_edgR_TMM_CtrvAce.txt            DE_salmon_vm_QW_Q_CtrvLyral.txt
DE_salmon_edgR_TMM_CtrvLyral.txt          DE_salmon_vm_QW_Q_LyralTrnvLyralExp.txt
DE_salmon_edgR_TMM_LyralTrnvLyralExp.txt  hist_edger_p_value_CtrvAce.pdf
DE_salmon_vm_QW_Q_AceTrnvAceExp.txt       hist_limma_p_value_CtrvAce.pdf

venn_diagram_DE_salmon_fC_edgeR_CtrvAce.pdf DE_fc_edgR_TMMCtrvAce.txt DE_salmon_edgR_TMM_CtrvAce.txt

ConYel commented 1 year ago

Hello @methornton, the code for the venn and the histogram were made for only one condition at that time, just for exploration. To make for each condition we need to change the function a bit. I will check it with some toy data and update you. Regarding the resulted txt files. You can use the significant from there to perform any kind of downstream analysis you prefer. For mirna targets you can use some online tool.

methornton commented 1 year ago

Hello! I removed the condition that had zero significant genes and the script still hangs at the same spot. Perhaps I made a mistake in transcription?

# design ----
colnames(design) # check the names and make the contrasts
con_mat <- makeContrasts(AceTrnvAceExp = EXP.Ace - TRN.Ace, LyralTrnvLyralExp = EXP.Lyral - TRN.Lyral, CtrvLyral = ((TRN.Lyral + EXP.Lyral)/2) - CTR.Control, levels = design)

## salmon ----
salmon_edgR_TMM <- estimateDisp(salmon_edgR_TMM, design = design, robust=TRUE)
salmon_edgR_TMM <- glmQLFit(salmon_edgR_TMM, design, robust = TRUE)

DE_salmon_edgR <- con_mat %>% colnames() %>% set_names() %>%
        map(~glmQLFTest(salmon_edgR_TMM, contrast = con_mat[,.x]) %>%
            topTags(n = nrow(.), adjust.method = "BH", sort.by = "PValue", p.value = 1) %>%
            .$table %>%
            as_tibble(rownames = "smallRNA") %>%
            write_tsv(file.path(dat_path[['salmon']], str_c("DE_salmon_edgR_TMM_", .x, ".txt")))
        )

pdf(file.path(dat_path[['salmon']],str_c("hist_edger_p_value_", names(DE_salmon_edgR[1]),".pdf") ))
hist(DE_salmon_edgR[[1]]$PValue, breaks = 0:20/20,col = "grey50", border = "white")
dev.off()

salmon_edgeR_TMM_p <- DE_salmon_edgR[[1]] %>%
        mutate(salmon_edgeR = if_else(
                                      FDR >= 0.05, 0, if_else(
                                                              logFC > 0, 1, -1
                                      )
                                      )) %>%
select(smallRNA , salmon_edgeR)

So something that is interesting is that for the salmon there is inserted a '_' between TMM and the condmat condition. this is not so with featureCounts. In the code above, i would seem like there would not be a '' inserted.


DE_fc_edgR_TMMAceTrnvAceExp.txt      DE_fc_vm_QW_Q_CtrvLyral.txt
DE_fc_edgR_TMMCtrvLyral.txt          DE_fc_vm_QW_Q_LyralTrnvLyralExp.txt
DE_fc_edgR_TMMLyralTrnvLyralExp.txt  hist_edger_p_value_AceTrnvAceExp.pdf
DE_fc_vm_QW_Q_AceTrnvAceExp.txt      hist_limma_p_value_AceTrnvAceExp.pdf

methornton commented 1 year ago

Sure. Ok then thanks. So, I set the script to one condition at a time. I can run the script for each condition, no problem. I really appreciate your help. I can get what I need from those spreadsheets. The annotation forging was very important. it would be nice to get all the graphs. it is a lot of work.

ConYel commented 1 year ago

Hello @methornton , could you let me know where the script hangs? Is it the issue on the histogram only or before?

ConYel commented 1 year ago

Hello @methornton ,have you solved the issues?

methornton commented 1 year ago

Hi @ConYel I can get most of the EDA script to work. I am getting an intermittent error, probably with pkgconfig. functionally, I can get the count data out and then process it normally. So there are no deal breakers. Thank you!! It would be nice to be able to use contrast matrices for multiple comparisons, practically that will improve adoption.

ConYel commented 1 year ago

Hello @methornton, yes it should have been changed to use in one run all the comparisons. I had some issues on how to implement it with the Venn. because it's time the names of the comparisons change so I didn't dive into it much. Mostly I was focusing on getting the DE from all comparisons and the create any other plots I need individually.

ConYel / wind

DGA issue with edgeR contrast matrix, error in `dat_path[["salmon"]]`: ! subscript out of bounds #5