RGLab / flowCore

Core flow cytometry infrastructure
43 stars 25 forks source link

spillover matrix is incorrectly normalized IMHO #94

Closed SamGG closed 5 years ago

SamGG commented 6 years ago

Dear hard workers, I compare FlowJo compensation matrix to flowCore result. IMHO, changes to the spillover function are needed:

Check my work at http://rpubs.com/SamGG/spillover-omip-018 My work deals with the following part of the code. Cheers.

https://github.com/RGLab/flowCore/blob/37c4546341f1475fda49d7a1d63c82bdce5489aa/R/flowSet-accessors.R#L749-L762

              # background correction
              inten <- pmax(sweep(inten[-unstained, ], 2, inten[unstained, ]), 0)
>>> changes start here
              # normalize each row
              # If the channel order was not set above, then a guess is made based on the
              # largest statistic for the compensation control (i.e., the row).
              # If the channels are "ordered", the diagonal must be equal to 1.
              # If the channels are "regexpr" matched, the files (row of the matrix) must
              # be ordered first, then the diagonal must be equal to 1.
              if (stain_match == "intensity") {
                # normalize by max of each row
                inten <- sweep(inten, 1, apply(inten, 1, max), "/")
                # find max for each file
                channel_order <- apply(inten, 1, which.max)
                if (anyDuplicated(channel_order) > 0) {
                  stop("Unable to match stains with controls based on intensity: ",
                       "a single stain matches to several multiple controls. ",
                       call. = FALSE)
                }
                # order the row
                inten <- inten[order(channel_order),]
              } else {
                if (stain_match == "regexpr")  # order the row
                  inten = inten[channel_order, ]
                # normalize by the diagonal
                inten <- sweep(inten, 1, diag(inten), "/")      
              }

              # Updates row names
>>> changes end here
              rownames(inten) <- colnames(inten)
              inten
gfinak commented 6 years ago

Can you clarify what exactly is the error or problem is that you are having? I don’t quite understand from your code what the issue was. Is this something specific to one data set or a general problem with how the spillover matrices are calculated? Just so you know, the spillover code is about as old as the entire flow infrastructure and predates my own involvement in the project so there’s definitely room to improve it, and the spillover calculation doesn’t get much usage, from what I can tell. Can you provide a bit more exposition in your example, maybe provide a reproducible example that we can use for testing etc? I'm very glad you are using this API and look forward to improving it.

Thanks, Greg Finak

On Dec 20, 2017 15:47, "Samuel Granjeaud" notifications@github.com wrote: Dear hard workers, I compare FlowJo compensation matrix to flowCore result. IMHO, changes to the spillover function are needed:

reorder the rows (regexpr & intensity modes) normalize to diagonal (regexpr & ordered modes) Check my work at http://rpubs.com/SamGG/spillover-omip-018 My work deals with the following part of the code. Cheers.

https://github.com/RGLab/flowCore/blob/37c4546341f1475fda49d7a1d63c82bdce5489aa/R/flowSet-accessors.R#L749-L762

          # background correction
          inten <- pmax(sweep(inten[-unstained, ], 2, inten[unstained, ]), 0)

changes start here

normalize each row

If the channel order was not set above, then a guess is made based on the

largest statistic for the compensation control (i.e., the row).

If the channels are "ordered", the diagonal must be equal to 1.

If the channels are "regexpr" matched, the files (row of the matrix) must

be ordered first, then the diagonal must be equal to 1.

if (stain_match == "intensity") {

normalize by max of each row

inten <- sweep(inten, 1, apply(inten, 1, max), "/")

find max for each file

channel_order <- apply(inten, 1, which.max) if (anyDuplicated(channel_order) > 0) { stop("Unable to match stains with controls based on intensity: ", "a single stain matches to several multiple controls. ", call. = FALSE) }

order the row

inten <- inten[order(channel_order),] } else { if (stain_match == "regexpr") # order the row inten = inten[channel_order, ]

normalize by the diagonal

inten <- sweep(inten, 1, diag(inten), "/")
}

          # Updates row names

changes end here rownames(inten) <- colnames(inten) inten — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

SamGG commented 6 years ago

The first problem is algorithmic. In ordered mode, no problem because the user states what (s)he does. If regexpr is selected and files need to be reorganized in order to match the markers order, then the result is the same as the ordered mode although it should not. This results from the matching being computed, but not applied, and from unstained file being not removed from the matching list (L685).

The second problem is numerical. It arises when the maximum intensity for a row (ie a FCS file) does not correspond to the stained channel (ie column) of that FCS file. From what I understand, the coefficients on the diagonal of the spillover matrix should be 1. Mike's example (code below) shows this effect at the lower right. This is in fact just a scaling coefficient, but the matrix does not looks like FlowJo. I think you have to download the OMIP 018 panel from flowrepository if you want to validate vs FlowJo. The OMIP-018 that biomiha told me (bioc forum) sounds interesting. This is a published panel, and I tried various codes on my Rpubs page. For the files (row of the matrix) where the highest coefficient is not on the the coefficients are clearly different from FlowJo (screenshots at the end of the rpubs - I think FlowJo computation are OK).

Let me know if I am wrong in my deductions. Thanks for your answer.

# Mike's example
library(flowCore)
fs <- read.flowSet(files = dir(system.file("extdata", "compdata", "data", package="flowCore"), full.names=TRUE))
fs <- fs[, -6] # excluding the redundant channel
spillover(fs, unstained = 1, fsc = "FSC-H", ssc = "SSC-H", stain_match = "ordered")
             FL1-H        FL2-H       FL3-H       FL4-H
FL1-H 1.0000000000 0.2420222776 0.032083706 0.001127816
FL2-H 0.0077220477 1.0000000000 0.140788232 0.002632689
FL3-H 0.0007590319 0.0009620459 0.003218614 1.000000000
FL4-H 0.0150806322 0.1755899032 1.000000000 0.229593860

# regexpr, result OK, unstained being called FL7-H
> sampleNames(fs) = c("FL7-H", "FL1-H", "FL2-H", "FL3-H", "FL4-H")
> spillover(fs, unstained = 1, fsc = "FSC-H", ssc = "SSC-H", stain_match = "regexpr")
              FL1-H        FL2-H       FL3-H       FL4-H 
FL1-H 1.0000000000 0.2420222776 0.032083706 0.001127816 
FL2-H 0.0077220477 1.0000000000 0.140788232 0.002632689 
FL3-H 0.0007590319 0.0009620459 0.003218614 1.000000000 
FL4-H 0.0150806322 0.1755899032 1.000000000 0.229593860 

# Same result although unstained is called FL1-H, ie this file should have been removed
> sampleNames(fs) = c("FL1-H", "FL2-H", "FL3-H", "FL4-H", "FL5-H") 
> spillover(fs, unstained = 1, fsc = "FSC-H", ssc = "SSC-H", stain_match = "regexpr")
              FL1-H        FL2-H       FL3-H       FL4-H 
FL1-H 1.0000000000 0.2420222776 0.032083706 0.001127816 
FL2-H 0.0077220477 1.0000000000 0.140788232 0.002632689 
FL3-H 0.0007590319 0.0009620459 0.003218614 1.000000000 
FL4-H 0.0150806322 0.1755899032 1.000000000 0.229593860 

# Same result although files are in the reverse order
> sampleNames(fs) = sprintf("FL%d-H", 5:1) 
> spillover(fs, unstained = 1, fsc = "FSC-H", ssc = "SSC-H", stain_match = "regexpr")
              FL1-H        FL2-H       FL3-H       FL4-H 
FL1-H 1.0000000000 0.2420222776 0.032083706 0.001127816 
FL2-H 0.0077220477 1.0000000000 0.140788232 0.002632689 
FL3-H 0.0007590319 0.0009620459 0.003218614 1.000000000 
FL4-H 0.0150806322 0.1755899032 1.000000000 0.229593860
gfinak commented 6 years ago

@SamGG Another old issue.. but clearly an unresolved one. The spillover function has been a pain since we took over maintenance. We have a new api spillover_ng which takes an external file that maps FCS file names to markers: https://github.com/RGLab/flowCore/blob/d167ef3aad70a80614f94009112c804119774255/R/flowSet-accessors.R#L786

I don't recall if I'd addressed the diagonal scaling issue, I'll have to revisit it.

The changes you posted on your Rpubs, would you be willing to submit a PR for the fixed version of spillover?

SamGG commented 6 years ago

@gfinak I am running out of time currently. As soon as possible, I will look into the API changes. Thanks.

SamGG commented 6 years ago

For memo, a description step by step of the compensation process: https://en.wikibooks.org/wiki/User:SBabovic/CytometRy

jacobpwagner commented 5 years ago

@SamGG I independently ran in to the ordering problem for the regexpr option for spillover using the same example while in the process of updating the vignette. Whenever you get a chance, will you see if commit f914eac5b6aeb97aa46ccc495184a501c406e435 fixes your issue? I'm also carefully going through the logic of spillover_ng, so if any bugs are found there I'll have fixes up soon.

jacobpwagner commented 5 years ago

@SamGG The spillover methods were restructured in 70949e0a122860583affc7680ccc4a94442ca2f3 and https://github.com/RGLab/flowStats/commit/17f4468399a0e71f85425babcd1d18cacb7f58e6. I believe the ordering and normalization issues here should be solved now as there was legitimately a bug regarding the ordering of rows before f914eac5b6aeb97aa46ccc495184a501c406e435.

SamGG commented 5 years ago

@jacobpwagner Many thanks for your work. I am currently overbooked and my mind is fully on Galaxy stuff. I will come back to cytometry next week. Best.

jacobpwagner commented 5 years ago

@SamGG , I'm going to go ahead and close this based on the fixes I put in. If any issues remain, please let me know.