ebecht / infinityFlow

25 stars 8 forks source link

Marker_XGBoost_bgc_intensity on UMAP plots #19

Open Cicevk opened 1 year ago

Cicevk commented 1 year ago

Hi,

Many thank for the package.

While analysing my data I realised that on UMAP plots of certain markers, cells are labeled with high intensity (orange to red), but when I go and have a look on these markers in my FlowJo analysis, I see that these markers are not expressed at all once compared to the isotype control. On the other hand, markers for which my cells are 100% positive when compared to the isotype control in FlowJo, on UMAP appear very red just in certain clusters/domains while other cells are even on the bluish side.

Now I would like to understand why is it so, and if I should somehow correct for that since intuitively I would expect the opposite in two cases I just described above. Can somehow backbone I am choosing influence this or does it has to do with the Z-scoring or nonlinear regression model I am using (in this case I only am focusing on XGboost)?

I was wondering if the extra_args_read_FCS argument would help to avoid truncating marker intensities while reading from .fcs, but realised that it should be set to default: extra_args_read_FCS=list(emptyValue=FALSE,truncate_max_range=FALSE,ignore.text.offset=TRUE)

Many thanks and have a great day.

ebecht commented 1 year ago

Hello,

Two things come to my mind that could be causing this :

  1. As you mentioned, the z-score transformation applied to predicted intensities before plotting on the UMAP. I think that in some instances it makes sense to not use it (especially when there was no specific staining for that antibody).
  2. Are you using "background removal" ? In some cases this may lead to positive cells appearing dimmer than they are. If you don't have much autofluorescence / unspecific binding or are working on pre-gated data (e.g. T cells only), I think it may be better to work on the non background-corrected data.

All the best, Etienne

denvercal1234GitHub commented 1 year ago

Hi @ebecht. Thank you for your response to the questions above.

For Q1. Would you mind elaborating a bit more on how to "not use" the z-score transformation for the prediction?

The "Training non default regression models" and "Basic usage of the infinityFlow package" tutorials are very useful to run the package, but do you also have any recommendations/guideline/tutorials on how to assess/QC the imputed values? This would be very useful and help more people feel comfortable to using the imputed data.

Thank you again for your help!

ebecht commented 1 year ago

Hello,

If you do not want to use the z-score, you will need to use the output FCS files to produce the plots you want ! I am not planning to expand plotting functionalities, as this is not core to the package.

For QCs, I agree with you that this is currently lacking and I am planning to add support for this in the future (as soon as I return to academia basically). I will add plots to identify saturation and look at prediction versus measurements on held-out events (akin to the right panel of Sup Fig 2 of the paper). You should be able to generate the latter quite easily from the split FCS outputs, by plotting the measured PE signal against the corresponding predictions. Events in the output files are all from held-out data so this should work fine. image

In the paper we use the AUC as a metric for performance, which is the best we could come up with. I don't think it is necessary for QC: it is a bit tedious to compute since you need to discretize the measured PE signal for every well beforehand. Visual examination of these plots should be enough.

denvercal1234GitHub commented 1 year ago

Thank you so much, Etienne! Looking forward to the QC tool support soon!

Cicevk commented 1 year ago

Hi,

Thank you for discussing this issue. I do understand what the code does better now.

Many thanks.

denvercal1234GitHub commented 1 year ago

Hi @ebecht - Thanks again for all your help so far.

A reviewer came back with major suggestion arguing that we cannot just rely on the visualization of the imputed vs actual staining of a marker to assess whether the imputed values for a marker using a model is sufficient.

They ask that we provide a more quantiative metric to decide how much the imputed expression matches with the actual expression for a given marker (and do so for every marker).

I know you are very busy atm, but do you know any writeup to do this (e.g., computing the AUC, and r like done in your paper) if you have not yet had the time to write some codes for the QC you mentioned above?

Thanks for your support!

denvercal1234GitHub commented 1 year ago

Hi @ebecht -- I hope all is well. In the Fig.2. in the paper below, most "r" values are below 0.6 even though AUC >0.8.

In the paper, it says "Exhaustive manual examination showed that 155 of 252 phenotypic markers (61.5%, excluding isotype controls and autofluorescence measurements) yielded meaningful imputed signal." What does this exhaustive manual examination" entails? And, how did you make the thresholds to decide this 61.5% of files having meaningful imputed signals?

I was trying to calculate at least the correlation coefficient (r) for the measured PE vs imputed PE using spearman and Kendall, but for each FCS file, it has 25K cells (rows) for 2 parameters (measured PE and imputed PE), which include ties, thus the approximate p-value was all "0" so I did not know how to then set the threshold to select which imputed markers were ok. Mannually visualising the scatter plots was not sufficient for the reviewers...

Would you mind providing some input on how to calculate the r/AUC? Thank you so much.

Screenshot 2023-05-21 at 13 10 16
denvercal1234GitHub commented 1 year ago

Hi @ebecht -- I hope all is well. I was wondering if you might have had a chance to look into some codes for the QC tools you mentioned above? Looking forward to them whenever they become available. Thanks again very much for your help.