const-ae / proDA

Protein Differential Abundance for Label-Free Mass Spectrometry https://const-ae.github.io/proDA/
17 stars 8 forks source link

Replacing 0's by NA's on log transformed data #3

Closed phauchamps closed 4 years ago

phauchamps commented 4 years ago

Hi Constantin,

I am currently doing an internship with Laurent Gatto 's bioinformatics group of UCLouvain, and part of my assignment consists in discovering and testing the proDA package since it provides interesting novel ideas to cope with data missingness :-)

I think I found a bug in the proDA() function code, in the following part :
if (any(!is.na(data) & data == 0)) { warning(paste0("The data contains", sum(!is.na(data) & data == 0), " exact zeros. ", "Replacing them with 'NA's.")) data[!is.na(data) & data == 0] <- NA } if (!data_is_log_transformed) { data <- log2(data) }

The problem is that when you submit log_transformed data you migth end up with data that are exactly zeros without being NA data at ll. I got myself the issue and I had, as a work-around, to 1. un-logtransform the data prior to submitting them to proDA() with 'data_is_log_transformed' flag set to false, in order to avoid this behaviour.

Could I suggest that you only replace 0 by NA's if :

Could you possibly have a look into it ?

Thanks!

Philippe

const-ae commented 4 years ago

Hi Philippe,

thanks for your interest in the package. I am surprised that your log transformed data actually contains exact zeros. You are right that in theory this is of course possible, if the original intensity was 1, but I thought this was highly unlikely, that is why I added the check that you found in the source code.

Could you post a histogram of the original and the log2() transformed data like this:

library(proDA)

full_data <- read.delim(
  system.file("extdata/proteinGroups.txt", 
              package = "proDA", mustWork = TRUE),
  stringsAsFactors = FALSE
)

intensity_colnames <- grep("^LFQ\\.intensity\\.", colnames(full_data), value=TRUE)

# Create matrix which only contains the intensity columns
data <- as.matrix(full_data[, intensity_colnames])

hist(data, main = paste0(sum(data == 0), " exact zeros"))

summary(c(data))
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> 0.000e+00 0.000e+00 0.000e+00 7.933e+06 1.725e+06 2.130e+09
hist(log2(data), main = paste0(sum(log2(data) == 0), " exact zeros"))

summary(c(log2(data)))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    -Inf    -Inf    -Inf    -Inf   20.72   30.99

Created on 2019-10-26 by the reprex package (v0.3.0)

Best, Constantin

phauchamps commented 4 years ago

Hi Constantin,

Thanks for your swift reply!

Actually the reason why I get exact zero's is that I first log tranform my data, and then I normalize them by centering around the median per sample.

I am using MSnBase package and I do the following :

image

image

image

The normalization around the median gives exactly zero for one data per column if the number of proteins with no NA is uneven. In my case, I have 6 samples, and I got 3 data with exactly zero after transformation. This is why I got the issue, which I actually worked around as mentioned in my previous submission.

In any case, I do think the standard behavour of proDA() should not be that all exact zero's are automatically replaced by NA's. The user should be able to control what data - among the data he is providing - are to be considered as NA's and which not, especially when the log transformed data are directly provided. I would advise to let the user control this by an additional parameter like 'areZeroNAs' with default value 'F' or something similar.

What do you think ?

Kind Regards

Philippe

phauchamps commented 4 years ago

Sorry my last histogram was not the correct one. The following is the correct one :

image

const-ae commented 4 years ago

Hi, sorry for the delay. I am currently stuck in a course, so I couldn't be as active as I would have liked.

I have changed the code that now it only converts 0's to NA's if there are no NA's in the dataset as you suggested.

I will close this issue for now. If I missed something or the fix works different than you expect, feel free to re-open.

Best, Constantin

phauchamps commented 4 years ago

Thanks Constantin ! Btw I will send you shortly an e-mail with another proDA related topic, which might lead to a much more interesting discussion :-) Best, Philippe

const-ae commented 4 years ago

Great, looking forward to it :)

phauchamps commented 4 years ago

Could you possibly provide me with an e-mail address I can use to send you an e-mail , which is not public ? Thanks, Philippe


De : Constantin notifications@github.com Envoyé : vendredi 8 novembre 2019 20:31 À : const-ae/proDA proDA@noreply.github.com Cc : phauchamps philippehauchamps@hotmail.com; Author author@noreply.github.com Objet : Re: [const-ae/proDA] Replacing 0's by NA's on log transformed data (#3)

Great, looking forward to it :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/const-ae/proDA/issues/3?email_source=notifications&email_token=AL6I5UOCKABRSF4ONH3EQKLQSW47NA5CNFSM4JFCYY5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDTD7CQ#issuecomment-551960458, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL6I5UKSBPN733RS3NHCUETQSW47NANCNFSM4JFCYY5A.