Weights for Samples of LinDA Modelling

DarioS commented 1 week ago

I notice that the top-ranked bacteria are sometimes caused by outliers of linda analysis. This happens when some samples have five species and other species have fifty species detected, for instance. I would like linda to have an option for weighting of samples inversely to the number of species detected for each sample. lm has weights parameter. linda should also support weights.

cafferychen777 commented 1 week ago

Dear Dario Strbenac,

Thank you for your feedback regarding the LinDA analysis in MicrobiomeStat. I appreciate you bringing this to our attention. To ensure we address your concern effectively, could you please provide some additional clarification?

Could you elaborate on the specific scenarios where you've observed this issue with outliers affecting the top-ranked bacteria?
Do you have any examples or sample datasets that demonstrate this problem? This would be incredibly helpful for us to understand and replicate the issue.
Regarding your suggestion for weighting samples inversely to the number of species detected, could you explain in more detail how you envision this working? Are there any specific parameters or methods you had in mind?
Have you encountered similar weighting options in other tools or packages that you think could serve as a good model for implementing this in LinDA?

Your insights will greatly help us improve MicrobiomeStat and ensure it meets the needs of our users. Thank you for your time and contribution to the project.

Best regards, Caffery Yang

DarioS commented 1 week ago

Oops, inverse weights would not make sense. Standard weights would. We noticed the issue when we identified the same species of bacteria but opposite fold change to published by another team who also did qPCR validation. The comparison was healthy volunteers versus cancer-adjacent normal tissue. Note that two Healthy group samples have high proportions.

> healthy[1, ] # Just the ten healthy volunteer samples subset of the whole matrix.
                        Oral_1-N Oral_2-N Oral_3-N Oral_4-N Oral_5-N Oral_6-N Oral_7-N Oral_8-N Oral_9-N Oral_10-N
Cutibacterium acnes            0        0    0.113      0.1     0.69    0.025     0.71    0.093        0     0.000

The high proportion is an artefact of only three species detected for samples Oral_5-N and Oral_7-N.

> colSums(healthy > 0)
 Oral_1-N  Oral_2-N  Oral_3-N  Oral_4-N  Oral_5-N  Oral_6-N  Oral_7-N  Oral_8-N  Oral_9-N Oral_10-N 
        2         0        14        14         3         5         3         7         6        15

Code to reproduce the Cutibacterium acnes contradictory result is:

library(MicrobiomeStat)
load("testLindaRobustness.RData")
normalsCancerFit <- linda(bacteriaMatrixNormals, clinicalTableNormals, "~ Age + Smoking + Gender + `Tissue Type`",
                          "proportion")
results <- normalsCancerFit[["output"]][["`Tissue Type`Normal"]]
results["Cutibacterium acnes", c("log2FoldChange", "lfcSE", "stat", "pvalue", "padj")]

                    log2FoldChange    lfcSE      stat       pvalue         padj
Cutibacterium acnes      -8.123449 1.859128 -4.369493 5.445916e-05 0.0009802648

If I plot the proportions, I see that there are two outliers in Healthy group and one outlier in Normal group.

library(ggplot2)
theme_set(theme_bw())
plotData <- data.frame(sampleID = colnames(bacteriaMatrixNormals),
                       proportion = bacteriaMatrixNormals["Cutibacterium acnes", ],
                       group = clinicalTableNormals[, "Condition"])
ggplot(plotData, aes(x = sampleID, y = proportion)) + geom_bar(stat = "identity") +
       facet_grid(cols = vars(group), scales = "free_x", space = "free_x") +
       geom_hline(yintercept = 1.00, linetype = "dashed", colour = "red") + 
       theme(axis.text.x = element_text(angle = 90)) + ggtitle("Cutibacterium acnes Proportion")

In statistics, it is popular to down-weight samples instead of discarding them.

Weighted regression is defined as “a generalization of linear regression where the covariance matrix of errors is incorporated in the model”. In simple terms, this means that not all samples are equal in the eyes of the data scientist, and this inequality should also reflect in the fitted model as well.

So, ideally, I would like to be able to do something like:

myWeights <- colSums(proportionsMatrix > 0)
sampleInfo$myWeights <- myWeights
linda(proportionsMatrix, sampleInfo, "~ status","proportion", weights = "myWeights")

Winsorisation makes almost no difference. I tried TRUE and FALSE. Would weights interfere with bias adjustment, though?

cafferychen777 / MicrobiomeStat

Weights for Samples of LinDA Modelling #59