tascCODA to analyse compositional changes in scRNAseq between case and control (taking in account covariates as age and sex)

bio-datascience / tascCODA

tree-aggregated compositional analysis for high-throughput sequencing data

BSD 3-Clause "New" or "Revised" License

8 stars 0 forks source link

tascCODA to analyse compositional changes in scRNAseq between case and control (taking in account covariates as age and sex) #3

Open mohebg opened 1 year ago

mohebg commented 1 year ago

Hi, Good day, thank you for the nice package.

I have some questions on how to use tascCODA to regress covariants as age and sex in addressing the compositional changes between case and control in scRNAseq.

In your paper you state: "More generally, however, tascCODA enables to determine how host phenotype, such as disease status, host covariates such as age, gender, or an individual’s demographics, or environmental factors jointly influence the compositional counts"

Shall the formula be written like this:

tree_mod= ana.CompositionalAnalysisTree( datax.copy(), reference_cell_type="automatic", formula="PATH+age+sex", reg="scaled_3", pen_args={"phi": 0, "lambda_1": 1.7} )

"PATH" is the metadata with "Case" vs "Control" labels.
Would making the formula "PATH+age+sex" regress out age/sex in the case vs control comparison?
I also wanted to ask what the following arguments mean? "reg="scaled_3" pen_args={"phi": 0, "lambda_1": 1.7}

Thank you very much in advance.

Best Moheb

johannesostner commented 1 year ago

Hi @mohebg, thanks for your interest in tascCODA!

The "formula" parameter determines, like in R's lm function, which covariates are considered for modeling. Currently, tascCODA performs model selection for all covariates in the formula, meaning that we look whether effects are significant for all covariate/tree node pairs. It's not possible at the moment to just adjust for a covariate without running model selection on it, although this might be possible in a future update.

Regarding the other arguments, you can ignore the reg parameter. This is only needed for switching between earlier versions of the tree-aggregated penalization scheme. The one described in the paper is "reg_3", which is also the default. With the pen_args parameter, you can set the phi (aggregation bias) and lambda_1 (regularization strength) values, like they are described in the paper.

I hope that this answers your questions!

mohebg commented 1 year ago

Hi @johannesostner ,

Thanks alot for your prompt reply. According to my understanding, the best practice for adjusting for a covariate (or the statistical elimination of a covariate) is to simply add the covariate to the linear model. As you have stated the formula is an R style, so in order regress out age and sex, shall the formula be written like this: formula="PATH+age+sex".

"PATH" is the metadata with "Case" vs "Control" labels.

So, would making the formula "PATH+age+sex" regress out age/sex in the case vs control comparison?

Thanks alot

johannesostner commented 1 year ago

Yes, just add the covariate to the model. That's what I would do as well. As I said earlier, this does not "regress out" age/sex, but tascCODA will try to find significant impacts of age/sex and adjust for them accordingly. If age/sex don't have a significant impact on the composition, they also won't be adjusted for. In that regard, it's not a standard adjustment for the covariates.

johannesostner commented 1 year ago

Also, please make sure that all covariates are scaled to the same range (i.e. [0-1]), as the selection of significant associations will otherwise be biased

mohebg commented 1 year ago

@johannesostner , thanks alot for your reply, I appreciate. I am not sure if I fully understand the sentence "covariates are scaled to the same range (i.e. [0-1])".

I have there levels of covariants:

pathology vs control - which is a categorical covariate
male vs female - which is a categorical covariate
age is a numeric continuous covariate

johannesostner commented 1 year ago

Just make sure that age is also scaled to a range between 0 and 1 (i.e. via min-max scaling like we did in the microbiome application of our paper). Otherwise the effects for age (since its range is so much bigger than for the categorical covariates, which will be encoded as 0/1) will be very small numerically and thus never selected to be significantly different from 0.