ebecht / infinityFlow

26 stars 8 forks source link

Warning: Some data values of 'CD49E.LASSO2' channel exceed its $PnR value 175864747 and will be truncated! To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE' #20

Closed denvercal1234GitHub closed 1 year ago

denvercal1234GitHub commented 1 year ago

Hi there,

Thanks again for the package.

During the Exporting results, I encountered a Warning as below for almost all markers even though I used PeacoQC::RemoveMargin to filter out events greater than 2e+06 before running infinityFlow. My data was acquired by spectral cytek Aurora.

Do you usually set truncate_max_range = FALSE in this scenareo? If so, how might I do so as the infinity_flow is just one function?

The argument extra_args_read_FCS is for reading in input but I did not find analogous argument for the output FCS.

Thank you for your help.

Screenshot 2023-03-06 at 23 13 59 Screenshot 2023-03-06 at 23 14 44
ebecht commented 1 year ago

Hi,

I think R shows warnings at the end of the function call so I would still expect these warnings to be produced by data import. Which is weird because based on the name of the parameters, these appear to have already been produced by regression models. So I am a bit confused about what is going on here ! Any additional details you could share?

Thanks, Etienne

denvercal1234GitHub commented 1 year ago

Hi @ebecht - These warnings were produced after the function infinity_flow(). My input data to infinity_flow() was untransformed FCS files that have been undergone PeacoQC::RemoveMargins() with minRange = -Inf, maxRange = 2000000 for all channels (except Time and FSC, and SSC).

I used fluorescence-minus-PE as "Blank" for Isotype.

This line all(names(TRIAL21_F37_annotation) %in% list.files(path_to_fcs)) returned FALSE but all(names(isotypes) %in% list.files(path_to_fcs)) returned TRUE

In the backbone specificiation file, I "discard" FJComp-AF-A -- do you do this too?

I used XGBoost, SVM, LASSO2, and LM.

Below is the code:

TRIAL_fcs_data <- read.flowSet(path=".../FCS_trial", pattern="*.fcs", transformation = FALSE, truncate_max_range = FALSE)

channels <- c("FJComp-AF-A", 
              "FJComp-AF594-A", 
              "FJComp-APC-A",
              "FJComp-APC-Fire 750-A", "FJComp-PE-A",...)

channelSpecs <- list(
              "FJComp-AF-A" = c(minRange = -Inf, maxRange = 2000000), 
              "FJComp-AF594-A"= c(minRange = -Inf, maxRange = 2000000), ...), 
"FJComp-PE-A"= c(minRange = -Inf, maxRange = 2000000),...)

TRIAL_fcs_data_RemovedMargin <- list()

for (i in 1:length(TRIAL_fcs_data)){
    TRIAL_fcs_data_RemovedMargin[[i]] <- PeacoQC::RemoveMargins(ff=TRIAL_fcs_data[[i]], channels=channels, channel_specifications = channelSpecs, output="frame")
    names(F37_CD4CD8DP_Untransformed_TRIAL_fcs_data_RemovedMargin)[i] <- F37_CD4CD8DP_Untransformed_TRIAL_fcs_data[[i]]@description$`$FIL`
    }

## Optional converting a list to flowSet for using FlowViz 
TRIAL_fcs_data_RemovedMargin_fs <- flowCore::as(TRIAL_fcs_data_RemovedMargin, "flowSet")

My annotation file

Screenshot 2023-03-08 at 11 13 57

My backbone specification

Screenshot 2023-03-08 at 11 15 21 Screenshot 2023-03-08 at 11 16 07 Screenshot 2023-03-08 at 11 16 17
regression_functions <- list(
    XGBoost = fitter_xgboost, # XGBoost
## Passed to fitter_nn, e.g. neural networks through keras::fit. See https://keras.rstudio.com/articles/tutorial_basic_regression.html
#NN = fitter_nn,
    SVM = fitter_svm, # SVM
    LASSO2 = fitter_glmnet, # L1-penalized 2nd degree polynomial model
    LM = fitter_linear # Linear model
)

backbone_size <- table(read.csv(TRIAL21_backbone_selection_file)[,"type"])["backbone"]
backbone_size

extra_args_regression_params <- list(

    list(nrounds = 500, eta = 0.05),

    list(type = "nu-regression", cost = 8, nu=0.5, kernel="radial"),

    list(alpha = 1, nfolds=10, degree = 2),

    list(degree = 1)
)

if(length(regression_functions) != length(extra_args_regression_params)){
    stop("Number of models and number of lists of hyperparameters mismatch")
}

imputed_data <- infinity_flow(
    regression_functions = regression_functions,
    extra_args_regression_params = extra_args_regression_params,
    path_to_fcs = "..../FCS_trial",
    path_to_output = "..../TRIAL_21FCS/output",
    path_to_intermediary_results = "..../TRIAL_21FCS/tmp",
    backbone_selection_file = TRIAL21_backbone_selection_file,
    annotation = targets,
    isotype = "Blank",
    input_events_downsampling = Inf, 
    prediction_events_downsampling = 1000
    verbose = TRUE,
    #Note: there is an issue with serialization of the neural networks and socketing since I updated to R-4.0.1. If you want to use neural networks, please make sure to set cores = 1L
    cores = cores,
    neural_networks_seed = NULL
)
ebecht commented 1 year ago

The thing is I don't see extra arguments in flowCore::write.FCS about data truncation, so I am a bit puzzled by what you are showing. Are you sure the input files do not contains channels named CD244.LASSO2 for instance ?

In the backbone specificiation file, I "discard" FJComp-AF-A -- do you do this too?

I don't have much experience with spectral flow cytometry so I don't really know the answer to that question. My intuition is that it is a potentially informative feature of the cells so it may be worth keeping, unless it somehow biases prediction...

denvercal1234GitHub commented 1 year ago

Thanks @ebecht for your response. Definitely, the input FCS do not have any channels with .LASS02.

denvercal1234GitHub commented 1 year ago

Hi @ebecht -- From flowCore it looks like it was issues with PnR and data truncation that happened in the read.FCS instead of write.FCS (https://github.com/RGLab/flowCore/issues/169).

From https://support.bioconductor.org/p/130629/, it was advised to specify range(fr, type = "data") after read.FCS() to view the min/max taken from the actual data.

fr <- read.FCS(fcs_file_path)
range(fr) # will default to instrument range
range(fr, type = "data") # Will use the actual values, as in summary

But this would only be read-only. Responses from https://github.com/RGLab/flowCore/issues/169 clarified that data is potentially already truncated in the read.FCS and we would just need to switch truncation off when reading the file read.FCS(fcs_file_path, truncate_max_range=F).

Would you mind informing me how we could set truncate_max_range=F within infinity_flow()?

denvercal1234GitHub commented 1 year ago

Interestingly, when I checked the codes of infinity_flow() (https://github.com/ebecht/infinityFlow/blob/master/R/00_master.R), line 16 indicates that the truncate_max_range was already set to FALSE when reading FCS into inifnity_flow().

And in the default values for infinity_flow(), looks like this has already been set to FALSE?

extra_args_read_FCS = list(emptyValue = FALSE, truncate_max_range = FALSE,
    ignore.text.offset = TRUE)

Perhaps, this issue here helps?

denvercal1234GitHub commented 1 year ago

Hi @ebecht -- I realized also that only when I set input_events_downsampling <- Inf that the "Warning about Some data values of '.....LASSO2' channel exceed its $PnR value 722220591 and will be truncated!" appeared.

When I set input_events_downsampling <- 2000, there was no Warning. And, it appears that only markers predicted by LASSO2 that have this warning.

Q1. If the value of a channel really exceed 722,220,591, would not we want to remove those events anyway?

Q2. From the output plots for some markers like CD99, the predictions across the 4 models in file thata did not have CD99-PE Ab did not match well compared to the actual staining in file that has CD99-PE Ab. But, for some other markers, the prediction was okay. Is there a quantiative way to decide which model is better for which marker (instead of manually inspecting by plotting)? Or if we should pick 1 model for all files, is there a more quantitative way to decide which model is best for which FCS files?

Thanks for your help!

ebecht commented 1 year ago

Hi @denvercal1234GitHub

As I was suggesting in my first reply, this warning sounds indeed like it is produced by flowCore::read.FCS. The only places where this happens in the code is when input data is read (see here). These input files should not have a channel called something.LASSO2.

As for models, we showed in our paper that LASSO2 (2nd degree polynomial models) were quite bad in this context and shouldn't be used. In general we recommended XGBoost. SVMs, NNs and XGBoost give overall very similar results so it does not matter much which one you choose.

denvercal1234GitHub commented 1 year ago

Thanks @ebecht for your response. That is very strange because indeed my input file does not have any channel called something.LASSO2, but infinity_flow still threw this warning. Yet, the warning only happens when input_events_downsampling <- Infand did not happen when input_events_downsampling <- 2000 for example.

ebecht commented 1 year ago

That is likely because when you use 2000 events you randomly get rid out of the problematic ones.

In any case, I'd encourage you to not use linear and polynomial models which we have shown to be less accurate, so if the warnings are not produced by the other models I think you can safely ignore them.

denvercal1234GitHub commented 1 year ago

Great. Thank you, Etienne!