adw96 / DivNet

diversity estimation under ecological networks
83 stars 18 forks source link

DivNet on Transcripts per Million (TPM) values derived from metagenomic data #141

Open lingyi-owl opened 9 months ago

lingyi-owl commented 9 months ago

Hi!

Thanks for developing the tool and the RUST extension. I have a viral metagenomic dataset with 100 samples. I assembled viral contigs from each sample, and clustered the contigs into vOTUs using MMSeq2. I used quantification tool Salmon (https://www.nature.com/articles/nmeth.4197) to quantify the vOTUs from each sample. The output of Salmon gave TPM values. The TPM values are floating numbers, instead of integers. I am wondering if I can use DivNet with the TPM values.

Could you please give me any thoughts?

Best, Lingyi

adw96 commented 9 months ago

Hi @lingyi-owl -- in theory, yes -- but in practice, no 😿

Are the TPMs large numbers? Could they be coerced to be integers without too much rounding?

I hope that helps!

mooreryan commented 9 months ago

@adw96 I'm curious why you say only in theory....I haven't carefully looked at the DivNet code in a minute, but if I recall the only places where the code really relies on the count matrix W being actual integers (as opposed to floating point) is when simulating the counts with rmultinom (like, when determining the size parameter of that function).

But interestingly, you can pass floating point value to the size parameter of the rmultinom function...in the C code, it looks like that argument will automatically get coerced into an integer (here). And it does seem to behave fine with the floating point size:

> set.seed(1234); a <- rmultinom(10, size = 150.5, prob = c(0.5, 0.3, 0.20))
> set.seed(1234); b <- rmultinom(10, size = 150,   prob = c(0.5, 0.3, 0.20))
> stopifnot(all.equal(a, b))

Anyway, it seems that as long as all of the TPM values are >= 1, things should work out okay numerically speaking...am I missing something?

mooreryan commented 8 months ago

I had a moment to look at this and sometimes divnet works fine with floats and other times it does not. The failures give this error: Error in { : task 3 failed - "missing value where TRUE/FALSE needed".

It's pretty interesting. I haven't had a chance to go through and try to diagnose the error, but I figured I would post this anyway.

Check out this reprex:

library(DivNet)
#> Loading required package: phyloseq
#> Loading required package: breakaway
library(ggplot2)
library(patchwork)

run_dn <- function(counts, mm, seed = 435432) {
  set.seed(seed)

  divnet(
    W = counts,
    X = mm,
    tuning = list(
      EMiter = 6, EMburn = 3, MCiter = 500, MCburn = 250, stepsize = 0.01
    ),
    perturbation = 0.05,
    network = "default",
    base = 1,
    variance = "parametric",
    B = 5,
    ncores = 1
  )
}

groups <- data.frame(
  group = c("A", "A", "B", "B"),
  row.names = c("A1", "A2", "B1", "B2")
)

mm <- model.matrix(~group, data = groups)

counts <- data.frame(
  A1 = c(100, 50, 25, 0, 0),
  A2 = c(110, 45, 30, 0, 0),
  B1 = c(8, 22, 25, 45, 55),
  B2 = c(10, 20, 30, 40, 50),
  row.names = c("T1", "T2", "T3", "T4", "T5")
) |>
  t()

# Add 0.01 to the first value.
counts2 <- counts
counts2[1, 1] <- counts[1, 1] + 0.01

# Add 0.01 to every value.
counts3 <- counts + 0.01

# Add 0.01 to every non-zero value.
counts4 <- ifelse(counts == 0, 0, counts + 0.01)

# Add 0.01 to every zero value.
counts5 <- ifelse(counts == 0, counts + 0.01, counts)

dn <- run_dn(counts, mm)
# dn2 <- run_dn(counts2, mm) # Fails
dn3 <- run_dn(counts3, mm)
# dn4 <- run_dn(counts4, mm) # Fails
dn5 <- run_dn(counts5, mm)

# Plot the ones that work.
plot(dn) + ylim(0.75, 1.6) +
  plot(dn3) + ylim(0.75, 1.6) +
  plot(dn5) + ylim(0.75, 1.6)

Created on 2023-12-08 with reprex v2.0.2

When it does successfully run, the results look fine.

lingyi-owl commented 7 months ago

Hi @mooreryan thanks for the response! It seems that I have too many features so the program did not run successfully.

I have a general question about if I should use TPM to do compositional data analysis for metagenomic data.

For studying community composition and diversity in a viral metagenomic dataset, I assembled viral contigs from each sample and clustered the contigs into vOTUs. I quantified the abundance of the vOTUs from each sample by mapping reads to the vOTUs to obtain a read abundance table. For compositional analysis, I should do a centered-log transformation on the abundance table, but before that, I think I should normalize the read abundance data based on the contig lengths. I found that people studying RNA-seq use transcript per million data as abundance data. TPM is a value that normalize the counts on both contig/transcript lengths and sequencing depth. Thus, the sum of all TPM values is the same in all samples, such that "a TPM value represents a relative expression level that, in principle, should be comparable between samples". I have a question about if I should use the TPM or the contig length normalized abundance data for centered-log ratio transformation? Thank you in advance!