arnesmits / DEP

DEP package
27 stars 12 forks source link

Wrong results if the data is not sorted alphabetically #29

Open AlexWeinreb opened 2 years ago

AlexWeinreb commented 2 years ago

During the standard analysis, filter_missval() silently sorts the rowData alphabetically. If we skip this step, this results in a rowData where the rownames and the name column contain the same information, but add_rejections() later sorts on the name column without reordering rownames, getting values that do not match. This creates problems later on when functions such as single_plot() use subset(), which work on rownames, then tries to use the name column.

For example, with the steps from ?add_rejections:

# Load example
data <- UbiLength
data <- data[data$Reverse != "+" & data$Potential.contaminant != "+",]
data_unique <- make_unique(data, "Gene.names", "Protein.IDs", delim = ";")

# Make SummarizedExperiment
columns <- grep("LFQ.", colnames(data_unique))
exp_design <- UbiLength_ExpDesign
se <- make_se(data_unique, columns, exp_design)

# Filter, normalize and impute missing values
# filt <- filter_missval(se, thr = 0)  # <- no filtering step: no sorting!
norm <- normalize_vsn(se)
imputed <- impute(norm, fun = "MinProb", q = 0.01)

# Test for differentially expressed proteins
diff <- test_diff(imputed, "control", "Ctrl")
dep <- add_rejections(diff, alpha = 0.05, lfc = 1)

Then the resulting object has wrong names:

SummarizedExperiment::rowData(dep, use.names = TRUE) |>
  as.data.frame() |> head() |> dplyr::select(name, ID)
#>             name       ID
#> RBM47   A7E2Y1-3 A7E2Y1-3
#> UBA6        AAAS   Q9NRG9
#> ILVBL      AAMDC Q9H7C9-3
#> FAM92A1     AAR2   Q9Y312
#> ACO2       ABCB6   H7BXK9
#> RPRD1B     ABCB7 O75027-3

plot_single(dep, proteins = c("A7E2Y1-3","AAAS"))
#> Warning: 6 parsing failures.
#> row col           expected actual
#>   1  -- value in level set  ACTR3
#>   2  -- value in level set  ACTR3
#>   3  -- value in level set  ACTR3
#>   4  -- value in level set  TUFM 
#>   5  -- value in level set  TUFM 
#> ... ... .................. ......
#> See problems(...) for more details.

For users: always use filter_missval(), even with a high threshold.