UCLouvain-CBIO / scp

Single cell proteomics data processing
https://uclouvain-cbio.github.io/scp/index.html
19 stars 2 forks source link

normByType function in advanced usage vignette #37

Closed steveb-123 closed 1 year ago

steveb-123 commented 1 year ago

@cvanderaa After trying to wrap my head around the example rowwise normalisation function in the advanced usage vignette, I think that it may have a bug.

It probably not super important but I tend to stall out on confusing things until I understand them, and since it is an example, rather than the main point of the vignette it probably should be smooth to get past.

The function as written is this:

normByType <- function(x, type) {
    ## Check argument
    stopifnot(length(type) == ncol(x))
    ## Normalize for each type separately
    for (i in type) {
        ## Get normalization factor
        nf <- rowMedians(x[, type == i], na.rm = TRUE)
        ## Perform normalization
        x[, type == i] <- x[, type == i] / nf
    }
    ## Return normalized data
    x
}

The logic of the iteration confused me somewhat... as far as I could tell the same values were computed repeatedly. I came to think actually the following should be the direct equivalent (without duplicate computations), and it also made sense to me much more intuitively:

normByType <- function(x, type) {
    ## Check argument
    stopifnot(length(type) == ncol(x))
    ## Normalize for each type separately
    for (i in unique(type)) {         ## <---- CHANGE
        ## Get normalization factor
        nf <- rowMedians(x[, type == i], na.rm = TRUE)
        ## Perform normalization
        x[, type == i] <- x[, type == i] / nf
    }
    ## Return normalized data
    x
}

However, when I tested things, the second function version actually appeared to not arrive at the same value. So I delved a little deeper and explicitly tested my suspicions.

I found, as I suspected that the first function performed 38 iterations (many duplicate) while the second performed 6 iterations. When I performed a napkin summary with sum() on the differing resultant matrices, my function got the identical results in all the conditions except one - Macrophage. By this (poor?) measure, my version got an Inf sum for the Macrophage assay matrix, whereas the original function version got a reasonable sounding (though high) real number, compared with the other conditions. I assumed that my way was wrong for some unknown reason...

So I ran this, to check what is actually happening over the iterations:

normByType <- function(x, type) {
  ## Check argument
  stopifnot(length(type) == ncol(x))
  ## Normalize for each type separately
  for (i in type) {
    print(i)
    ## Get normalization factor
    nf <- rowMedians(x[, type == i], na.rm = TRUE)
    ## Perform normalization
    print(paste0("Current rowmedians (sum): ",sum(nf, na.rm=TRUE )))
    print(paste0(i, " before: ", sum(x[, type == i], na.rm = TRUE)))
    x[, type == i] <- x[, type == i] / nf
    print(paste0(i, " after: ", sum(x[, type == i], na.rm = TRUE)))
  }
  ## Return normalized data
  x
}

sce <- getWithColData(scp1, "proteins")
mnorm <- normByType(assay(sce), type = sce$SampleType)
> > sce <- getWithColData(scp1, "proteins")
+ mnorm <- normByType(assay(sce), type = sce$SampleType)
Warning message:
'experiments' dropped; see 'metadata' 
> [1] "Carrier"
[1] "Current rowmedians (sum): 18616047.47"
[1] "Carrier before: 19115166.32"
[1] "Carrier after: 242"
[1] "Reference"
[1] "Current rowmedians (sum): 1510273.29"
[1] "Reference before: 1548841.245"
[1] "Reference after: 228"
[1] "Unused"
[1] "Current rowmedians (sum): 999214.6275"
[1] "Unused before: 1448970.665"
[1] "Unused after: 299.986827636659"
[1] "Macrophage"
[1] "Current rowmedians (sum): 478724.0325"                           ### NOTE
[1] "Macrophage before: 3875081.12"                           
[1] "Macrophage after: Inf"                                                      ### <----- Changes to Inf
[1] "Monocyte"
[1] "Current rowmedians (sum): 490568.5525"
[1] "Monocyte before: 895594.955"
[1] "Monocyte after: 383.914285093242"
[1] "Macrophage"
[1] "Current rowmedians (sum): Inf"                                       ### Now
[1] "Macrophage before: Inf"
[1] "Macrophage after: 1665.18998300829"                        ### <------ "Fixes"? the Inf
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Carrier"
[1] "Current rowmedians (sum): 234"
[1] "Carrier before: 242"
[1] "Carrier after: 242"
[1] "Reference"
[1] "Current rowmedians (sum): 220"
[1] "Reference before: 228"
[1] "Reference after: 228"
[1] "Unused"
[1] "Current rowmedians (sum): 217"
[1] "Unused before: 299.986827636659"
[1] "Unused after: 299.986827636659"
[1] "Monocyte"
[1] "Current rowmedians (sum): 224"
[1] "Monocyte before: 383.914285093242"
[1] "Monocyte after: 383.914285093242"
[1] "Blank"
[1] "Current rowmedians (sum): 335160.115"
[1] "Blank before: 600140.575"
[1] "Blank after: 213.402037902834"
[1] "Monocyte"
[1] "Current rowmedians (sum): 224"
[1] "Monocyte before: 383.914285093242"
[1] "Monocyte after: 383.914285093242"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Carrier"
[1] "Current rowmedians (sum): 234"
[1] "Carrier before: 242"
[1] "Carrier after: 242"
[1] "Reference"
[1] "Current rowmedians (sum): 220"
[1] "Reference before: 228"
[1] "Reference after: 228"
[1] "Unused"
[1] "Current rowmedians (sum): 217"
[1] "Unused before: 299.986827636659"
[1] "Unused after: 299.986827636659"
[1] "Unused"
[1] "Current rowmedians (sum): 217"
[1] "Unused before: 299.986827636659"
[1] "Unused after: 299.986827636659"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Blank"
[1] "Current rowmedians (sum): 137"
[1] "Blank before: 213.402037902834"
[1] "Blank after: 213.402037902834"
[1] "Monocyte"
[1] "Current rowmedians (sum): 224"
[1] "Monocyte before: 383.914285093242"
[1] "Monocyte after: 383.914285093242"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Monocyte"
[1] "Current rowmedians (sum): 224"
[1] "Monocyte before: 383.914285093242"
[1] "Monocyte after: 383.914285093242"
[1] "Blank"
[1] "Current rowmedians (sum): 137"
[1] "Blank before: 213.402037902834"
[1] "Blank after: 213.402037902834"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"
[1] "Macrophage"
[1] "Current rowmedians (sum): 212"
[1] "Macrophage before: 1665.18998300829"
[1] "Macrophage after: 1665.18998300829"

There is some kind of stabilising self-iteration acting multiply on its own results. Curiously it makes the Inf value come and disappear again.

I'm not sure if this function is working as you intended or not, but I just wanted to give a heads up as it looks unintended at face value (and the example is more confusing because of it). Also, just in case you still use this code. I can imagine that if these values are were imputed or filtered in some later step maybe this weirdness disappears. Also, I havent looked into any more detail than this, so I could obviously be wrong somewhere.

I hope this helps in some way.

Thanks for all your help on the other channel!

cvanderaa commented 1 year ago

My apologies, there is indeed a bug in the code! Your fix

 for (i in unique(type)) {         ## <---- CHANGE

makes more sense!

Thanks for the notice!