biocore / songbird

Vanilla regression methods for microbiome differential abundance analysis
BSD 3-Clause "New" or "Revised" License
54 stars 25 forks source link

formatting full factorial formula #136

Closed sformel closed 3 years ago

sformel commented 3 years ago

I'm new to multinomial regression and using patsy. Although I've been reading through all the excellent documentation I'm having a hard time wrapping my head around the correct way to code a typical 2-way ANOVA type model. I was able to get songbird to give me results that look like what I want, but I wanted to verify I wasn't doing anything foolish in regard to the assumptions of songbird or patsy. The main reason I'm suspicious that I shouldn't be doing it is because I can't seem to find a clear example of it on the internet despite it being such a common experimental setup.

Imagine I have two categorical factors with two levels (in the classical stats sense, not in the patsy sense of factor):

var1: top, bottom
var2: left, right

I expected the formula

"var1*var2"

to return a differentials file with 5 columns following the k-1 rule mentioned in the songbird docs. After delving into the patsy docs I realized that was incorrect, though I don't fully comprehend it. I eventually arrived at this formula to get songbird to spit out the expected 5 columns:

"var1 + var2 + C(var1, Treatment('top')):var2 + C(var1, Treatment('bottom')):var2"

So I have three specific questions:

  1. Does this formula break any assumptions of patsy and or songbird?
  2. Is it correct to consider each interaction output column a pairwise comparison? i.e. does the column heading:

C(var1, Treatment('top'))[T.bottom]:var2[left]

mean that these are the ranked differentials between "top,left" and "bottom,right"?

  1. Is there a more succinct way to code this?

Thanks so much for all the hard work you've put into creating and maintaining this tool! I sincerely appreciate any thoughts you might have.

mortonjt commented 3 years ago

The formula var1 * var2 will give you var1 + var2 + var1:var2 will give you independent contrasts for var1 and var2 in addition to a single interaction term var1 : var2. Below is a quick dummy example to visualize this

import pandas as pd
from patsy import dmatrix
df = pd.DataFrame({'var1' : ['top', 'bottom'] * 10, 'var2' : ['left'] * 10 + ['right'] * 10})
dmatrix('var1 * var2', df, return_type='dataframe')
    Intercept  var1[T.top]  var2[T.right]  var1[T.top]:var2[T.right]
0         1.0          1.0            0.0                        0.0
1         1.0          0.0            0.0                        0.0
2         1.0          1.0            0.0                        0.0
3         1.0          0.0            0.0                        0.0
4         1.0          1.0            0.0                        0.0
5         1.0          0.0            0.0                        0.0
6         1.0          1.0            0.0                        0.0
7         1.0          0.0            0.0                        0.0
8         1.0          1.0            0.0                        0.0
9         1.0          0.0            0.0                        0.0
10        1.0          1.0            1.0                        1.0
11        1.0          0.0            1.0                        0.0
12        1.0          1.0            1.0                        1.0
13        1.0          0.0            1.0                        0.0
14        1.0          1.0            1.0                        1.0
15        1.0          0.0            1.0                        0.0
16        1.0          1.0            1.0                        1.0
17        1.0          0.0            1.0                        0.0
18        1.0          1.0            1.0                        1.0
19        1.0          0.0            1.0                        0.0

If you notice, var1[T.top]:var2[T.right] only equals to 1 if both var1[T.top] and var2[T.right] are equal to one.
The regression that this encodes is given by y = b0 + var1 b1 + var2 b2 + (var1 * var2) b3. The interaction effect should allow you to prove the differences (b3) due to the combination of var1 and var2.

I would think that this should encode for the interaction terms that you are looking for ( I think your formula is over-parameterized). Worst case scenario, you can run multiple models with different interactions. But I have a feeling that if you run var1 * var2, you should be able to derive all of the pairwise interactions from that single interaction term (it may take some thought though).

sformel commented 3 years ago

Thanks @mortonjt, that's a very clear explanation. The matrix really helps me visualize it, and I agree that is what I was looking for, I just didn't recognize it as such. You're absolutely right that all the pairwise comparisons can be made through the three columns. I'll see if I can come up with a quick way to sort them out.