fastlmm / FaST-LMM

Python version of Factored Spectrally Transformed Linear Mixed Models
https://fastlmm.github.io/
Apache License 2.0
47 stars 11 forks source link

How fastlmm deal with "NA" in covariate file? #4

Closed baolichang closed 4 years ago

baolichang commented 4 years ago

Hello: This is not really an issue, but I'm not sure how fastlmm deals with samples with missing covariates? Are the samples still included in the calculation but the missing covariate values are filled in as mean/median of all samples (or through imputation)? Or the samples are dropped out of the analysis? Thank you!

CarlKCarlK commented 4 years ago

Baolichang,

Thanks for using FaSTLMM. Great question and one that I did not know the answer to until I looked into it.

It doesn't! Specifically, when I looked at single_snp and single_snp_scale, I found that any NaN's will end up producing a NaN in the final output.

Work around: Fill missing with the column mean as a pre-processing step

Generate some sample data import numpy as np from pysnptools.snpreader import SnpData

Create a SnpReader for the covariate. Here it is an in-memory SnpData, but could be something like

covar = Pheno(filename)

np.random.seed(92392) covar = SnpData(iid=[('f0','iid0'),('f0','iid2'),('f0','iid3')], sid = ['height','weight'], val = np.random.random((3,2))) covar.val[0,0] = np.nan covar.val[2,1] = np.nan covar.val

array([[ nan, 0.17374433],

   [0.54600136, 0.52250945],

   [0.87375742,        nan]])

The actual work around that fills in missing

Assumes covar is a SnpReader, e.g. covar = Pheno(filename)

covar = covar.read(view_ok=True) #Turn the covariate into an in-memory SnpData

Fill columns with mean

See https://stackoverflow.com/questions/18689235/numpy-array-replace-nan-values-with-average-of-columns

inds = np.where(np.isnan(covar.val))

covar.val[inds] = np.take(np.nanmean(covar.val, axis=0), inds[1])

covar.val

array([[0.70987939, 0.17374433],

   [0.54600136, 0.52250945],

   [0.87375742, 0.34812689]])

In contrast, for SNP data, any missing values are filled in with the mean. For a single phenotype, any individuals with missing values are remove. single_snp_scale also supports multiple phenotypes. If there are multiple phenotypes and any missing value, it raise an error.

Do you think we should change the code do fill in missing covariates automatically? Or is it safer to just have users to this preprocessing step as needed?

Carl

Carl Kadie, Ph.D. FaST-LMM & PySnpTools Team (Microsoft Research, retired) Join the FaST-LMM user discussion and announcement list via emailmailto:fastlmm-user-join@python.org?subject=Subscribe (or use web sign uphttps://mail.python.org/mailman3/lists/fastlmm-user.python.org)


From: baolichang notifications@github.com Sent: Tuesday, February 18, 2020 7:42:26 AM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)

Hello: This is not really an issue, but I'm not sure how fastlmm deals with samples with missing covariates? Are the samples still included in the calculation but the missing covariate values are filled in as mean/median of all samples (or through imputation)? Or the samples are dropped out of the analysis? Thank you!

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastlmm%2FFaST-LMM%2Fissues%2F4%3Femail_source%3Dnotifications%26email_token%3DABR65PZ2ZIQL2NIJ6MLILWLRDP6WFA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IOLN35A&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472067700&sdata=AwetHAPa%2F5xXfDs7zP7rP8o%2Bz8%2FmrxaNgg%2FzxlIz7ZU%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P6VEWYT5FQTJ4X4HGTRDP6WFANCNFSM4KXHBRUQ&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472077705&sdata=Lw5plkWrevkncY9pQcx7CCTSVSqvLcobDrIGXKeOrcM%3D&reserved=0.

baolichang commented 4 years ago

Hi Carl: Thank you so much for the detailed reply!

It may make more sense to have the users to decide what to do with the missing values in the covariate file. In addition to filling in with the mean, some may prefer imputation or simply ignore the samples with missing covariates (a waste, but if bias may be introduced by mean/imputation it’s better not to use).

Thanks again for your help!

Bao-Li

From: Carl Kadie notifications@github.com Sent: Tuesday, February 18, 2020 3:36 PM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com Cc: Chang, Bao-Li baoli@pennmedicine.upenn.edu; Author author@noreply.github.com Subject: [External] Re: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)

Baolichang,

Thanks for using FaSTLMM. Great question and one that I did not know the answer to until I looked into it.

It doesn't! Specifically, when I looked at single_snp and single_snp_scale, I found that any NaN's will end up producing a NaN in the final output.

Work around: Fill missing with the column mean as a pre-processing step

Generate some sample data import numpy as np from pysnptools.snpreader import SnpData

Create a SnpReader for the covariate. Here it is an in-memory SnpData, but could be something like

covar = Pheno(filename)

np.random.seed(92392) covar = SnpData(iid=[('f0','iid0'),('f0','iid2'),('f0','iid3')], sid = ['height','weight'], val = np.random.random((3,2))) covar.val[0,0] = np.nan covar.val[2,1] = np.nan covar.val

array([[ nan, 0.17374433],

[0.54600136, 0.52250945],

[0.87375742, nan]])

The actual work around that fills in missing

Assumes covar is a SnpReader, e.g. covar = Pheno(filename)

covar = covar.read(view_ok=True) #Turn the covariate into an in-memory SnpData

Fill columns with mean

See https://stackoverflow.com/questions/18689235/numpy-array-replace-nan-values-with-average-of-columns

inds = np.where(np.isnan(covar.val))

covar.val[inds] = np.take(np.nanmean(covar.val, axis=0), inds[1])

covar.val

array([[0.70987939, 0.17374433],

[0.54600136, 0.52250945],

[0.87375742, 0.34812689]])

In contrast, for SNP data, any missing values are filled in with the mean. For a single phenotype, any individuals with missing values are remove. single_snp_scale also supports multiple phenotypes. If there are multiple phenotypes and any missing value, it raise an error.

Do you think we should change the code do fill in missing covariates automatically? Or is it safer to just have users to this preprocessing step as needed?

Carl

Carl Kadie, Ph.D. FaST-LMM & PySnpTools Team (Microsoft Research, retired) Join the FaST-LMM user discussion and announcement list via emailmailto:fastlmm-user-join@python.org?subject=Subscribe (or use web sign uphttps://mail.python.org/mailman3/lists/fastlmm-user.python.org)


From: baolichang notifications@github.com<mailto:notifications@github.com> Sent: Tuesday, February 18, 2020 7:42:26 AM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com<mailto:FaST-LMM@noreply.github.com> Cc: Subscribed subscribed@noreply.github.com<mailto:subscribed@noreply.github.com> Subject: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)

Hello: This is not really an issue, but I'm not sure how fastlmm deals with samples with missing covariates? Are the samples still included in the calculation but the missing covariate values are filled in as mean/median of all samples (or through imputation)? Or the samples are dropped out of the analysis? Thank you!

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastlmm%2FFaST-LMM%2Fissues%2F4%3Femail_source%3Dnotifications%26email_token%3DABR65PZ2ZIQL2NIJ6MLILWLRDP6WFA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IOLN35A&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472067700&sdata=AwetHAPa%2F5xXfDs7zP7rP8o%2Bz8%2FmrxaNgg%2FzxlIz7ZU%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P6VEWYT5FQTJ4X4HGTRDP6WFANCNFSM4KXHBRUQ&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472077705&sdata=Lw5plkWrevkncY9pQcx7CCTSVSqvLcobDrIGXKeOrcM%3D&reserved=0.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/4?email_source=notifications&email_token=AI24KU6W3S4RMIBY6CV4L7TRDRBBVA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMES2LY#issuecomment-587803951, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI24KU527JLM55IHH2TZMKLRDRBBVANCNFSM4KXHBRUQ.

CarlKCarlK commented 4 years ago

Bao-Li,

Excellent suggestions. At the very least I’ll add check for missing values and raise a clear error (in single_snp and single_snp_scale).

From: baolichangmailto:notifications@github.com Sent: Thursday, February 20, 2020 10:50 AM To: fastlmm/FaST-LMMmailto:FaST-LMM@noreply.github.com Cc: Carl Kadiemailto:carlk@msn.com; State changemailto:state_change@noreply.github.com Subject: Re: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)

Hi Carl: Thank you so much for the detailed reply!

It may make more sense to have the users to decide what to do with the missing values in the covariate file. In addition to filling in with the mean, some may prefer imputation or simply ignore the samples with missing covariates (a waste, but if bias may be introduced by mean/imputation it’s better not to use).

Thanks again for your help!

Bao-Li

From: Carl Kadie notifications@github.com Sent: Tuesday, February 18, 2020 3:36 PM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com Cc: Chang, Bao-Li baoli@pennmedicine.upenn.edu; Author author@noreply.github.com Subject: [External] Re: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)

Baolichang,

Thanks for using FaSTLMM. Great question and one that I did not know the answer to until I looked into it.

It doesn't! Specifically, when I looked at single_snp and single_snp_scale, I found that any NaN's will end up producing a NaN in the final output.

Work around: Fill missing with the column mean as a pre-processing step

Generate some sample data import numpy as np from pysnptools.snpreader import SnpData

Create a SnpReader for the covariate. Here it is an in-memory SnpData, but could be something like

covar = Pheno(filename)

np.random.seed(92392) covar = SnpData(iid=[('f0','iid0'),('f0','iid2'),('f0','iid3')], sid = ['height','weight'], val = np.random.random((3,2))) covar.val[0,0] = np.nan covar.val[2,1] = np.nan covar.val

array([[ nan, 0.17374433],

[0.54600136, 0.52250945],

[0.87375742, nan]])

The actual work around that fills in missing

Assumes covar is a SnpReader, e.g. covar = Pheno(filename)

covar = covar.read(view_ok=True) #Turn the covariate into an in-memory SnpData

Fill columns with mean

See https://stackoverflow.com/questions/18689235/numpy-array-replace-nan-values-with-average-of-columns

inds = np.where(np.isnan(covar.val))

covar.val[inds] = np.take(np.nanmean(covar.val, axis=0), inds[1])

covar.val

array([[0.70987939, 0.17374433],

[0.54600136, 0.52250945],

[0.87375742, 0.34812689]])

In contrast, for SNP data, any missing values are filled in with the mean. For a single phenotype, any individuals with missing values are remove. single_snp_scale also supports multiple phenotypes. If there are multiple phenotypes and any missing value, it raise an error.

Do you think we should change the code do fill in missing covariates automatically? Or is it safer to just have users to this preprocessing step as needed?

Carl

Carl Kadie, Ph.D. FaST-LMM & PySnpTools Team (Microsoft Research, retired) Join the FaST-LMM user discussion and announcement list via emailmailto:fastlmm-user-join@python.org?subject=Subscribe (or use web sign uphttps://mail.python.org/mailman3/lists/fastlmm-user.python.org)


From: baolichang notifications@github.com<mailto:notifications@github.com> Sent: Tuesday, February 18, 2020 7:42:26 AM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com<mailto:FaST-LMM@noreply.github.com> Cc: Subscribed subscribed@noreply.github.com<mailto:subscribed@noreply.github.com> Subject: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)

Hello: This is not really an issue, but I'm not sure how fastlmm deals with samples with missing covariates? Are the samples still included in the calculation but the missing covariate values are filled in as mean/median of all samples (or through imputation)? Or the samples are dropped out of the analysis? Thank you!

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastlmm%2FFaST-LMM%2Fissues%2F4%3Femail_source%3Dnotifications%26email_token%3DABR65PZ2ZIQL2NIJ6MLILWLRDP6WFA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IOLN35A&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472067700&sdata=AwetHAPa%2F5xXfDs7zP7rP8o%2Bz8%2FmrxaNgg%2FzxlIz7ZU%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P6VEWYT5FQTJ4X4HGTRDP6WFANCNFSM4KXHBRUQ&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472077705&sdata=Lw5plkWrevkncY9pQcx7CCTSVSqvLcobDrIGXKeOrcM%3D&reserved=0.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/4?email_source=notifications&email_token=AI24KU6W3S4RMIBY6CV4L7TRDRBBVA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMES2LY#issuecomment-587803951, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI24KU527JLM55IHH2TZMKLRDRBBVANCNFSM4KXHBRUQ.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastlmm%2FFaST-LMM%2Fissues%2F4%3Femail_source%3Dnotifications%26email_token%3DABR65P4CUSJVA4JI24XXTILRD3GGDA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMPTVCY%23issuecomment-589249163&data=02%7C01%7C%7Cb728bfc6f52146489f9e08d7b635b66f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637178214113900649&sdata=6Q4PjGeL3L5bVGEPnGiuRlyOeSUWwM%2FD6OGDnHA07lk%3D&reserved=0, or unsubscribehttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P7YDY3O5K3RE4ITVT3RD3GGDANCNFSM4KXHBRUQ&data=02%7C01%7C%7Cb728bfc6f52146489f9e08d7b635b66f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637178214113910636&sdata=QLk1Mt3Vl52PrPP0Ck1pUF2WZI%2BpHxwO%2BX46XBrhwtk%3D&reserved=0.