Open DarioS opened 1 week ago
Dear Dario Strbenac,
Thank you for your feedback regarding the LinDA analysis in MicrobiomeStat. I appreciate you bringing this to our attention. To ensure we address your concern effectively, could you please provide some additional clarification?
Could you elaborate on the specific scenarios where you've observed this issue with outliers affecting the top-ranked bacteria?
Do you have any examples or sample datasets that demonstrate this problem? This would be incredibly helpful for us to understand and replicate the issue.
Regarding your suggestion for weighting samples inversely to the number of species detected, could you explain in more detail how you envision this working? Are there any specific parameters or methods you had in mind?
Have you encountered similar weighting options in other tools or packages that you think could serve as a good model for implementing this in LinDA?
Your insights will greatly help us improve MicrobiomeStat and ensure it meets the needs of our users. Thank you for your time and contribution to the project.
Best regards, Caffery Yang
Oops, inverse weights would not make sense. Standard weights would. We noticed the issue when we identified the same species of bacteria but opposite fold change to published by another team who also did qPCR validation. The comparison was healthy volunteers versus cancer-adjacent normal tissue. Note that two Healthy group samples have high proportions.
> healthy[1, ] # Just the ten healthy volunteer samples subset of the whole matrix.
Oral_1-N Oral_2-N Oral_3-N Oral_4-N Oral_5-N Oral_6-N Oral_7-N Oral_8-N Oral_9-N Oral_10-N
Cutibacterium acnes 0 0 0.113 0.1 0.69 0.025 0.71 0.093 0 0.000
The high proportion is an artefact of only three species detected for samples Oral_5-N and Oral_7-N.
> colSums(healthy > 0)
Oral_1-N Oral_2-N Oral_3-N Oral_4-N Oral_5-N Oral_6-N Oral_7-N Oral_8-N Oral_9-N Oral_10-N
2 0 14 14 3 5 3 7 6 15
Code to reproduce the Cutibacterium acnes contradictory result is:
library(MicrobiomeStat)
load("testLindaRobustness.RData")
normalsCancerFit <- linda(bacteriaMatrixNormals, clinicalTableNormals, "~ Age + Smoking + Gender + `Tissue Type`",
"proportion")
results <- normalsCancerFit[["output"]][["`Tissue Type`Normal"]]
results["Cutibacterium acnes", c("log2FoldChange", "lfcSE", "stat", "pvalue", "padj")]
log2FoldChange lfcSE stat pvalue padj
Cutibacterium acnes -8.123449 1.859128 -4.369493 5.445916e-05 0.0009802648
If I plot the proportions, I see that there are two outliers in Healthy group and one outlier in Normal group.
library(ggplot2)
theme_set(theme_bw())
plotData <- data.frame(sampleID = colnames(bacteriaMatrixNormals),
proportion = bacteriaMatrixNormals["Cutibacterium acnes", ],
group = clinicalTableNormals[, "Condition"])
ggplot(plotData, aes(x = sampleID, y = proportion)) + geom_bar(stat = "identity") +
facet_grid(cols = vars(group), scales = "free_x", space = "free_x") +
geom_hline(yintercept = 1.00, linetype = "dashed", colour = "red") +
theme(axis.text.x = element_text(angle = 90)) + ggtitle("Cutibacterium acnes Proportion")
In statistics, it is popular to down-weight samples instead of discarding them.
Weighted regression is defined as “a generalization of linear regression where the covariance matrix of errors is incorporated in the model”. In simple terms, this means that not all samples are equal in the eyes of the data scientist, and this inequality should also reflect in the fitted model as well.
So, ideally, I would like to be able to do something like:
myWeights <- colSums(proportionsMatrix > 0)
sampleInfo$myWeights <- myWeights
linda(proportionsMatrix, sampleInfo, "~ status","proportion", weights = "myWeights")
Winsorisation makes almost no difference. I tried TRUE
and FALSE
. Would weights interfere with bias adjustment, though?
I notice that the top-ranked bacteria are sometimes caused by outliers of
linda
analysis. This happens when some samples have five species and other species have fifty species detected, for instance. I would likelinda
to have an option for weighting of samples inversely to the number of species detected for each sample.lm
hasweights
parameter.linda
should also support weights.