adw96 / DivNet

diversity estimation under ecological networks
83 stars 18 forks source link

Understanding the use of "X" in divnet() function #53

Closed gregpoore closed 3 years ago

gregpoore commented 4 years ago

Hi Amy,

I'm in Rob Knight's lab and excited to see (and use) your DivNet package — congrats on the in press acceptance! I have a question regarding the use of the covariate matrix ("X") in the divnet() function. The example given in your tutorial uses the Lee dataset, picking "char" (basalts of different characteristics) as the covariate. I understand that this is useful when sampling from different environments, but I'm trying to figure out if it would include samples from the same environment that are affected by external factors. A few examples:

Is it permissible to use these differences as covariates ("X") for the divnet() function (e.g. antibiotic vs. non-antibiotic gut samples) even though they're technically sampled from the same environment?

Thanks a ton!

adw96 commented 4 years ago

Hi Greg,

Thanks for bringing up this question. Choosing what variables to include in your model is not trivial and depends on your scientific question as well as your experimental design. I generally recommend taking at least a couple of statistics classes (up to multiple regression) to give some context for model fitting.

In short, the model that you would fit for DivNet is what you would include if you were fitting a simple or multiple regression on a univariate outcome. While I don't know the design of your experiments that you describe, here are some things that I would think about including in your models.

(Note that here I'm using R's + notation for adding variables to a model -- I'm not saying sum up the variables!)

I'm sorry that there isn't a more concrete answer for you, and that it depends so much on the setting that you're in. I'm happy to try to answer any more detailed questions you have, though. Statistical consultants at your institution can often advise on what variables to include in the model, too.

Amy

gregpoore commented 4 years ago

Hi Amy,

Thank you so much for the helpful and thorough reply! The only issue is that it doesn't seem possible to add a blocking variable (or any multivariate model) in the divnet() function. Looking at the code underlying the divnet() function (see below; emphasis added to parts with ** and arrows) seems to suggest that it will only accept a NULL or single character variable input (i.e. not a model.matrix() input required to add the blocking variable).

Code:

> divnet
function (W, X = NULL, fitted_model = NULL, tuning = NULL, perturbation = NULL, 
    network = NULL, base = NULL, ncores = NULL, variance = "parametric", 
    B = 5, nsub = NULL, ...) 
{
    if ("phyloseq" %in% class(W)) {
        input_data <- W
        W <- input_data %>% otu_table %>% as.matrix
        suppressWarnings({
            class(W) <- "matrix"
        })
        if (phyloseq::taxa_are_rows(input_data)) 
            W <- W %>% t
        samples_names <- input_data %>% sample_names

        ------> **if (is.character(X))** { <------

            X <- breakaway::make_design_matrix(input_data, X)
        }

       ------> **else if (is.null(X))** { <------

            xx <- input_data %>% sample_data %>% rownames
            X <- model.matrix(~xx)
        }
    }

Also, it appears that the breakaway::make_design_matrix(input_data, X) function will also only accept character variables for X. Do you know if there is any way to modify it so that a model.matrix can be used? I know that it already can be used in the betta() function.

Regarding the regression model: In my case, I have about ~20 metastases in the same tissue that originated from primary tumors of two different cancer types (half for each cancer type) across different patients. The data are shotgun sequencing for microbes with ~7000 features for the relatively small number of samples. I'm inclined, per your suggestion, to use X = tumor site of origin + patient but am curious if patient should be a blocking factor if all the patients are different. Could you please confirm this?

Thank you so much again! (Apologies if I misunderstood anything)

Sincerely, Greg

adw96 commented 4 years ago

Hi Greg! I had completely forgotten that I hadn't implemented formula specification for Divnet. I'm going to ask my co-author @bryandmartin to implement this ASAP.

@bryandmartin -- can you please make sure that DivNet maintains backwards compatibility but also allows for a formula? I think the corncob interface is great so perhaps we could just copy that. If you can write some testthat tests to make sure that it maintains backwards compatibility that would be great!

Greg -- if every observation comes from a different patient then definitely don't include patient as a variable -- otherwise your model will be way overparameterised! However, if you want to add more than one variable to the model, you can adapt the following code. It employs model.matrix to build a covariate matrix, and then passes it to DivNet. Feel free to ping me with questions about how it works! We will have the formula notation ready for you soon, too!

# Load the data set
data(Lee)
lee_phylum <- speedyseq::tax_glom(Lee, taxrank="Phylum")

## Get only the samples for which temperature is available.
## Unfortunately temperature is a factor, which is silly,
## so we have to do it like this.
lee_complete_ps <- lee_phylum %>%
  subset_samples(temp %in% c("12.7", "13.5", "13.7", "2.0", "7.3", "8.6"))

# build the covariate matrix
temps <- lee_complete_ps %>% sample_data %>% pull(temp) %>% as.character %>% as.numeric
types <- lee_complete_ps %>% sample_data %>% pull(type)
my_multivariate_x <- model.matrix(~ temp + type, data = data.frame(temp = temps, type = types))

# fit model and estimate diversity
lee_complete_dv <- lee_complete_ps %>%
  divnet(X=my_multivariate_x, ncores = 4)

# visualise
lee_complete_dv$shannon %>% summary %>%
  bind_cols("temp" = temps, "type" = types) %>%
  ggplot(aes(x = temp, y = estimate, col = type)) +
  geom_point() +
  geom_linerange(aes(ymin = lower, ymax = upper)) +
  ylab("Estimated Shannon diversity \n& confidence intervals") +
  xlab("Temperature (C)") +
  theme_bw()
paulinetrinh commented 3 years ago

We've updated the documentation for Getting Started to reflect the various ways to specify covariates in the divnet() function.