corbinq / apex

Toolkit for QTL mapping and meta-analysis.
https://corbinq.github.io/apex/
16 stars 1 forks source link

NaN results #5

Closed llandsmeer closed 3 years ago

llandsmeer commented 3 years ago

Dear APEX authors,

I just APEX on our genotypes+RNA expression dataset. It runs fine in QTLtools, but when I run it in APEX I get this as output:

1       693731  1010717 0       -nan    -nan    -nan    -nan    -nan    -nan    -nan    -nan    nan     -nan    -nan    nan
1       693731  1035805 0       -nan    -nan    -nan    -nan    -nan    -nan    -nan    -nan    nan     -nan    -nan    nan
1       693731  1131236 0       -nan    -nan    -nan    -nan    -nan    -nan    -nan    -nan    nan     -nan    -nan    nan
1       693731  1658945 0       -nan    -nan    -nan    -nan    -nan    -nan    -nan    -nan    nan     -nan    -nan    nan
1       693731  1793786 0       -nan    -nan    -nan    -nan    -nan    -nan    -nan    -nan    nan     -nan    -nan    nan
[... continues for 1832 rows, each n_cis_variants + 4 columns]

One possible hint might be that the cis_gene_table has all egene_pval at 0.5 and all resid_sd at -nan (and n_cis_variants in the expected range of a few thousand SNPs).

I performed all the necessary steps to convert the vcf/bed/covariates to the required input formats and tried with both NA values replaced by 0 or kept as in in the covariates.

What could possible cause this?

Kind regards, Lennart Landsmeer

corbinq commented 3 years ago

Hi Lennart,

Thanks for the bug report. Please share the following information:

Thanks, Corbin

On Sun, Jul 18, 2021, 8:01 AM Lennart Landsmeer @.***> wrote:

Dear APEX authors,

I just APEX on our genotypes+RNA expression dataset. It runs fine in QTLtools, but when I run it in APEX I get this as output:

1 693731 1010717 0 -nan -nan -nan -nan -nan -nan -nan -nan nan -nan -nan nan 1 693731 1035805 0 -nan -nan -nan -nan -nan -nan -nan -nan nan -nan -nan nan 1 693731 1131236 0 -nan -nan -nan -nan -nan -nan -nan -nan nan -nan -nan nan 1 693731 1658945 0 -nan -nan -nan -nan -nan -nan -nan -nan nan -nan -nan nan 1 693731 1793786 0 -nan -nan -nan -nan -nan -nan -nan -nan nan -nan -nan nan [... continues for 1832 rows, each n_cis_variants + 4 columns]

One possible hint might be that the cis_gene_table has all egene_pval at 0.5 and all resid_sd at -nan (and n_cis_variants in the expected range of a few thouasand SNPs).

What could possible cause this?

Kind regards, Lennart Landsmeer

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/corbinq/apex/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWQAM45HBJ4ZEBEWWNM5XDTYK7ANANCNFSM5ASCJCGQ .

llandsmeer commented 3 years ago

Hi Corbin,

Thanks for answering. This is the command:

apex cis \
    -b APEX_TRAIT.bed.gz \
    -c APEX_COV_1A_NaNto0 \
    -v "${gen_imputed}.chr1.vcf.gz" \
    -o apex_1a/chr1

where the two APEX_* files are files converted from QTLtools input format by removing the strand column from the trait file (data had no missing values and was already filtered and inverse normal transformed) & the factors in the covariates file were split into distinct dummy 0/1 variables (tried both removing NA's or adding a separate dummy for NA values, NA values for the continuous covariatas where also either kept as is or replaced with 0) & QTLtools covariate exclusion list already applied to the covariate file

This is the resulting CLI output (containing sample size, no of traits & no of covariates)

Using 2 threads.
1 present in both bcf and bed file.
581258 total variants on selected chromosomes.

Found 2124 samples in bcf file ...
Found 2322 samples in covariate file ...
Found 624 samples in expression bed file ...
Found 624 samples in common across all three files.

Processed data for 108 covariates across 624 samples.
Processed expression for 1838 genes across 624 samples.
Processed genotype data for 581258 variants.

Using OLS (no GRM specified).
Started cis-QTL analysis ...
Scaling expression traits ...
Calculating genotype-covariate covariance...
Calculating genotype residual variances ...Done.
Calculating expression residuals...
Scaling expression residuals ...
Processed 206 cis-QTL blocks out of 206 total
llandsmeer commented 3 years ago

To check whether it was because of the covariates, I redid the analysis using only PC1 & PC2 of the genotype:

20 present in both bcf and bed file.
167044 total variants on selected chromosomes.

Found 2124 samples in bcf file ...
Found 2322 samples in covariate file ...
Found 624 samples in expression bed file ...
Found 624 samples in common across all three files.

Processed data for 2 covariates across 624 samples.
Processed expression for 491 genes across 624 samples.
Processed genotype data for 167044 variants.

Using OLS (no GRM specified).
Started cis-QTL analysis ...
Scaling expression traits ...
Calculating genotype-covariate covariance...
Calculating genotype residual variances ...Done.
Calculating expression residuals...
Scaling expression residuals ...
Processed 54 cis-QTL blocks out of 54 total

Now it sees to work (no more NaN)!

Edit: Redid analysis with all dummy variables/factors removed -> same result (no NaN) What could possible happen that creates NaN values if I include those dummy variables?

corbinq commented 3 years ago

Hi,

Thanks for sharing the command and narrowing down the issue.

APEX does not currently support missing data in the covariate or trait file. (It does, however, support missing genotypes, and defaults to impute-to-mean.)

For covariate and trait data, you could first apply listwise deletion or impute-to-mean to generate new input files, and then pass those files to APEX.

I will make a note to make sure we return an explicit error message when missing traits or covariates are encountered.

Best, Corbin

On Sun, Jul 18, 2021 at 11:18 AM Lennart Landsmeer @.***> wrote:

To check whether it was because of the covariates, I redid the analysis using only PC1 & PC2 of the genotype:

20 present in both bcf and bed file. 167044 total variants on selected chromosomes.

Found 2124 samples in bcf file ... Found 2322 samples in covariate file ... Found 624 samples in expression bed file ... Found 624 samples in common across all three files.

Processed data for 2 covariates across 624 samples. Processed expression for 491 genes across 624 samples. Processed genotype data for 167044 variants.

Using OLS (no GRM specified). Started cis-QTL analysis ... Scaling expression traits ... Calculating genotype-covariate covariance... Calculating genotype residual variances ...Done. Calculating expression residuals... Scaling expression residuals ... Processed 54 cis-QTL blocks out of 54 total

Now it sees to work (no more NaN)!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/corbinq/apex/issues/5#issuecomment-882072775, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWQAM3QNAMK32DNJ37FLNDTYLWC3ANCNFSM5ASCJCGQ .

llandsmeer commented 3 years ago

Dear Corbin,

Thanks for the fast response.

I do not have missing data in my covariates (or anywhere else)

Kind regards, Lennart Landsmeer

llandsmeer commented 3 years ago

Ok I narrowed 1 specific case down even further.

I have 3 dummy variables for sample sex. M, F & Missing. All are 0/1 vectors. If I include only 1 of those or any combination of 2 (M&F, F&Missing or Missing&M) I do not get NaN values. If I include all three (M&F&Missing) I do get NaN's

Edit: This simple pattern does sadly not hold in general, and the way how different combinations of dummy variables might or might not produce NaN's looks pretty random to me. For example, combining the Hospital dummy's (fine by itself) with the Male&Female dummy's (also fine) results in NaN's

corbinq commented 3 years ago

Aha. This is an issue of perfect multicollinearity. APEX appends an intercept column to the covariate matrix. If you include dummy variables for all 3 values, you have a perfect linear dependency in the covariate design matrix. In that case, X^T X is not invertible, and hence the NaNs.

If your factor (sex) has 3 levels, then you need to include dummy variables for only 2 of the values. The particular choice of which two levels (missing, male, female) you choose to include as dummy variables and which is reference has no effect on the eQTL association results.

In the future, we could check for perfect multicollinearity in the design matrix to avoid this issue.

On Sun, Jul 18, 2021 at 11:49 AM Lennart Landsmeer @.***> wrote:

Ok I narrowed 1 specific case down even further.

I have 3 dummy variables for sample sex. M, F & Missing. All are 0/1 vectors. If I include only 1 of those or any combination of 2 (M&F, F&Missing or Missing&M) I do not get NaN values. If I include all three (M&F&Missing) I do get NaN's

Edit: This simple pattern does sadly not hold in general, and the way how different combinations of dummy variables might or might not produce NaN's looks pretty random to me

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/corbinq/apex/issues/5#issuecomment-882076728, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWQAMZB66K7LPF26PLPKBDTYLZYTANCNFSM5ASCJCGQ .

llandsmeer commented 3 years ago

Yes it is indeed perfect multicollinearity in the M/F/Missing case. However, I just did a VIF test for the Hospital+M+F case and they are not perfectly multicollinear, even after sample overlapping (altough VIF is pretty high), but this still generates NaN values.

corbinq commented 3 years ago

Hmm, strange. To be sure, in R, you might try constructing a design matrix X with four columns,

X = [Intercept, Hospital-dummy, Male-dummy, Female-dummy]

Then check whether solve(crossprod(X)) fails.

To be sure, you will want to subset X to the individuals that are present across all three files. It could be invertible in the overall sample, but not in the subset.

On Sun, Jul 18, 2021, 12:45 PM Lennart Landsmeer @.***> wrote:

Yes it is indeed perfect multicollinearity in the M/F/Missing case. However, I just did a VIF test for the Hospital+M+F case and they are not perfectly multicollinear (altough VIF is pretty high), but this generate NaN

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/corbinq/apex/issues/5#issuecomment-882084390, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWQAM2Z77WSONXEKHLLLX3TYMAKZANCNFSM5ASCJCGQ .

llandsmeer commented 3 years ago

Yes I was just reading through the code and implemented the make_half_hat_matrix function in python. numpy.linalg.eig(X.T@X) works, the design matrix cross product is invertible (and I only looked at the covbedvcf sample overlap). No idea what goed wrong :(

corbinq commented 3 years ago

Thanks. If you can send the covariate data, or just the [Sex, Hospital] columns to @.***, I could try to take a look later.

On Sun, Jul 18, 2021 at 1:55 PM Lennart Landsmeer @.***> wrote:

Yes I was just reading through the code and implemented the make_half_hat_matrix function in python. numpy.linalg.eig(X.T@X) works, the design matrix cross product is invertible (and I only looked at the cov bedvcf sample overlap). No idea what goed wrong :(

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/corbinq/apex/issues/5#issuecomment-882094850, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWQAMZ6QCH5TPN5DJKR7S3TYMIQ5ANCNFSM5ASCJCGQ .

llandsmeer commented 3 years ago

So after more trials - it seemed like there was an even more complicated perfect colinearity problem. After automatically one-by-one removal of all VIF=Inf covariates APEX runs smoothly again. Sorry for taking up your time and thanks for the support