bigomics / playbase

Core back-end functionality and logic for OmicsPlayground
Other
4 stars 0 forks source link

Error in pgx.computePGX when working on small datasets #46

Closed phisanti closed 1 year ago

phisanti commented 1 year ago

Using a reduced version of the example dataset, the base pgx pipeline fails. See code below:

get_mini_example_data <- function() {
  counts <- playbase::COUNTS
  samples <- playbase::SAMPLES
  contrast <- playbase::CONTRASTS

  n_genes <- round(seq(1, nrow(counts), length.out = 100))

  # Subset each data frame to facilitate testing
  mini_counts <- counts[n_genes, c(1:3, 8:9, 11:12)]
  mini_samples <- samples[colnames(mini_counts),]
  mini_contrast <- contrast[1:3, 1:3]

  mini_data <- list(counts = mini_counts, samples = mini_samples, contrast = mini_contrast)

  return(mini_data)
}

d <- get_mini_example_data()

pgx2 <- playbase::pgx.createPGX(
 samples = d$samples,
 counts = d$counts,
 contrasts = d$contrast
)

x <- playbase::pgx.computePGX(pgx2)

The error is:

x <- playbase::pgx.computePGX(pgx2)
[pgx.computePGX] testing genes...
>>> computing gene tests for SINGLE-OMICS
[compute_testGenesSingleOmics] detecting stat groups...
[compute_testGenesSingleOmics] contrasts on groups (use design)
replacing contrast matrix...
[compute_testGenesSingleOmics] pruning unused contrasts
[compute_testGenesSingleOmics] normalizing contrasts
[compute_testGenesSingleOmics] 6 : creating model design matrix
[compute_testGenesSingleOmics] WARNING:: low total counts =  11795.79 
[compute_testGenesSingleOmics] applying global mean scaling to 1e6...
filtering for low-expressed genes: > 1 CPM in >= 2 samples
filtering out 6 low-expressed genes
keeping 85 expressed genes
>>> Testing differential expressed genes (DEG) with methods: ttest.welch trend.limma edger.qlf
[compute_testGenesSingleOmics] 12 : start fitting...
[ngs.fitContrastsWithAllMethods] using input log-expression matrix X...
[ngs.fitContrastsWithAllMethods] fitting using Welch t-test
[ngs.fitContrastsWithAllMethods] fitting using LIMMA trend
[ngs.fitContrastsWithLIMMA] fitting LIMMA contrasts using design matrix
[ngs.fitContrastsWithAllMethods] fitting edgeR using QL F-test 
[ngs.fitContrastsWithEDGER] fitting EDGER contrasts using design matrix
[ngs.fitContrastsWithAllMethods] correcting AveExpr values...
[ngs.fitContrastsWithAllMethods] calculating statistics...
[ngs.fitContrastsWithAllMethods] reshape matrices...
[ngs.fitContrastsWithAllMethods] aggregating p-values...
[compute_testGenesSingleOmics] 13 : fitting done!
            user.self sys.self elapsed user.child sys.child
ttest.welch      0.00     0.00    0.01         NA        NA
trend.limma      0.01     0.02    0.03         NA        NA
edger.qlf        0.34     0.01    0.38         NA        NA
[compute_testGenesSingleOmics] done!
[pgx.computePGX] testing genesets...
Filtering gene sets on size...
Matching gene set matrix...
Reducing gene set matrix...
Error in order(colnames(G)[jj]) : argument 1 is not a vector
In addition: Warning message:
2 very small variances detected, have been offset away from zero

This, therefore, means that x does not exist.

> x
Error: object 'x' not found
phisanti commented 1 year ago

The error seems to come from the function compute_testGenesets in compute2-geneset.R. Apparently, there is a hard limit to gene set size. If a gene set has lower than 15 genes, then it won't select any column for further analysis. Here is the code:

  ## filter gene sets on size
  cat("Filtering gene sets on size...\n")
  gmt.size <- Matrix::colSums(G != 0)
  size.ok <- (gmt.size >= 15 & gmt.size <= 400)
  G <- G[, which(size.ok)]

Here we have two options:

  1. We can reduce the limit or making it relative to the input data.
  2. Try some kind of try-catch method so that the script can continue but the geneset will be skipped.

Any suggestion @ivokwee @ncullen93?

ncullen93 commented 1 year ago

Yeah a try-catch method would be best. I'm not sure if this error occurs in the omicsplayground platform or if this issue is caught before this code gets run. This is a good example of why all data catching code should be in playbase and not in omicsplayground. Looping in @ivokwee on this.

ivokwee commented 1 year ago

I would debug the error. Sometimes people have just 500-1000 genes. So it is a good edge case.

phisanti commented 1 year ago

Ok, taking on account that last comment, I came up with the following solution:


  # If dataset is too small that size.ok == 0, then select top 100
  if (sum(size.ok) == 0) {
    top_100gs <- head(sort(gmt.size,decreasing = TRUE), 100)
    size.ok <- names(gmt.size) %in% names(top_100gs)
  }

Injecting this code in compute_testGenesets after calculating size.ok should solve the issue. This way, there will always be at least 100 genesets. Let me know what you think, @ncullen93 @ivokwee.

ivokwee commented 1 year ago

Yes that's a good solution. But please test.

phisanti commented 1 year ago

I can confirm the edit works in the Playbase tests as well as in the omicsplayground