broadinstitute / tensorqtl

Ultrafast GPU-enabled QTL mapper
BSD 3-Clause "New" or "Revised" License
162 stars 52 forks source link

Interaction term on categorical variable #56

Open jvierstra opened 2 years ago

jvierstra commented 2 years ago

Is is possible run cis mapping with categorical interaction terms? I am trying to consider cell type in addition to genotype. My understanding of the GTEx paper is that the interaction term was continuous (the XCell enrichment variable) and the tensorQTL was run for each "cell type" at a time.

Thanks, Jeff

francois-a commented 2 years ago

Hi Jeff, Do you mean for terms with >2 categories? That's correct, in the GTEx paper we used the model y ~ g + i + g:i with continuous i for cell type, testing for significance of the g:i interaction term. We also used this model to map sex-biased (coded as 0/1) QTLs.

jvierstra commented 2 years ago

Yes, multiple categories.

In my particular case, we have ~150 individuals, each with 4 cell types. We would like to identify SNVs that cell type-selective QTLs and in which cell type(s).

y ~ g + cell_type + g:cell_type

cell_type would be encoded as a categorical variable (like T cell, B cell, etc).

francois-a commented 2 years ago

Ok, makes sense. It would be relatively straightforward to support multiple interaction terms (I've been planning to do this for some time), basically replacing the vector i with a matrix, for example dummy-coded categorical covariates. I should be able to add this fairly quickly.

jvierstra commented 2 years ago

Awesome -- thanks much!

Another issue I ran into is that the phenotype file column header must match the individuals plink file. Because I have multiple cell types for a single individual, I current have to replicate each individual in the genotype file (plink) for each cell type. It would be useful if a separate file could be used that defines which genotype corresponds to which phenotype, instead of requiring them to be the same in each file.

BTW -- this software has been amazingly helpful -- we are using it to detect chromatin QTLs in DNase I data with large phenotype matrices (>500K) and its handles them with little effort.

francois-a commented 2 years ago

The latest commit should work with multiple interaction terms. You can define them in a TSV/dataframe, as with the single interaction term before (the TSV requires a header if multiple interaction terms are provided).

For single interaction terms we had noticed some artifacts for variants with low MAF, and ended up applying a MAF filter in the top and bottom halves of the interaction term (here), defined with maf_threshold_interaction; this is currently not applied when multiple interactions are provided.

And glad to hear it! This is precisely the scale of analyses we were aiming to enable.

jvierstra commented 2 years ago

I will try this out this week and get back yo you. Thanks for helping me out!

jvierstra commented 2 years ago

I just tested this out and ran into a bug (could be a different torch version) See: https://github.com/broadinstitute/tensorqtl/pull/57#issue-1190764210

jvierstra commented 2 years ago

I just fully tested out your code and I think there is some errors when dealing with categorical interaction terms.

I made some changes to the calculate_interaction_nominal function in core.py. The variable categorical_interactions is set to true when the interaction matrix is binary.

For categorical interaction terms, I only center the genotypes and phenotypes and then apply the interaction terms to the centered genotype matrix. In this case, I also do not center the interaction terms.

After these changes, it appears to work without considering covariates, however, I can not get it to work when applying the residualizer (covariates).

Do you have any ideas on how i can fix this?

 # centered inputs
    g0_t = genotypes_t - genotypes_t.mean(1, keepdim=True)  # genotypes x samples
    if categorical_interactions:
        gi0_t = g0_t.unsqueeze(2) * interaction_t.unsqueeze(0) # just multiple centered genotypes by binary interaction term
    else:
        # center g:i afterwards when using continuous interaction terms
        gi_t = genotypes_t.unsqueeze(2) * interaction_t.unsqueeze(0) # genotypes x samples x interactions
        gi0_t = gi_t - gi_t.mean(1, keepdim=True)  # mean across samples

    # check if interactions matrix is binary; if so, do not center it
    if categorical_interactions:
        i0_t = interaction_t # keep binary
    else:
        i0_t = interaction_t - interaction_t.mean(0)  # samples x interactions

    p0_t = phenotypes_t - phenotypes_t.mean(1, keepdim=True)  # 1 x samples

    # residualize rows
    if residualizer is not None:
        p0_t = residualizer.transform(p0_t, center=False)
        g0_t = residualizer.transform(g0_t, center=False)

        i0_t = residualizer.transform(i0_t.t(), center=False).t()
        for k in range(i0_t.shape[1]):
            gi0_t[..., k] = residualizer.transform(gi0_t[..., k], center=False)

    i0_t = i0_t.repeat(ng, 1, 1)
jvierstra commented 2 years ago

Nevermind about the above. I ended up re-writing the interaction code from scratch and now allow for R-style formulae and any combination of categorical and continuous terms (and interactions).

See the code changes here: https://github.com/jvierstra/tensorqtl/commit/45e37379715c33d277bfe92a8da406968bd99f4f

image