Closed baolichang closed 4 years ago
Baolichang,
Thanks for using FaSTLMM. Great question and one that I did not know the answer to until I looked into it.
It doesn't! Specifically, when I looked at single_snp and single_snp_scale, I found that any NaN's will end up producing a NaN in the final output.
Work around: Fill missing with the column mean as a pre-processing step
Generate some sample data import numpy as np from pysnptools.snpreader import SnpData
np.random.seed(92392) covar = SnpData(iid=[('f0','iid0'),('f0','iid2'),('f0','iid3')], sid = ['height','weight'], val = np.random.random((3,2))) covar.val[0,0] = np.nan covar.val[2,1] = np.nan covar.val
array([[ nan, 0.17374433],
[0.54600136, 0.52250945],
[0.87375742, nan]])
The actual work around that fills in missing
covar = covar.read(view_ok=True) #Turn the covariate into an in-memory SnpData
inds = np.where(np.isnan(covar.val))
covar.val[inds] = np.take(np.nanmean(covar.val, axis=0), inds[1])
covar.val
array([[0.70987939, 0.17374433],
[0.54600136, 0.52250945],
[0.87375742, 0.34812689]])
In contrast, for SNP data, any missing values are filled in with the mean. For a single phenotype, any individuals with missing values are remove. single_snp_scale also supports multiple phenotypes. If there are multiple phenotypes and any missing value, it raise an error.
Do you think we should change the code do fill in missing covariates automatically? Or is it safer to just have users to this preprocessing step as needed?
Carl
Carl Kadie, Ph.D. FaST-LMM & PySnpTools Team (Microsoft Research, retired) Join the FaST-LMM user discussion and announcement list via emailmailto:fastlmm-user-join@python.org?subject=Subscribe (or use web sign uphttps://mail.python.org/mailman3/lists/fastlmm-user.python.org)
From: baolichang notifications@github.com Sent: Tuesday, February 18, 2020 7:42:26 AM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)
Hello: This is not really an issue, but I'm not sure how fastlmm deals with samples with missing covariates? Are the samples still included in the calculation but the missing covariate values are filled in as mean/median of all samples (or through imputation)? Or the samples are dropped out of the analysis? Thank you!
- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastlmm%2FFaST-LMM%2Fissues%2F4%3Femail_source%3Dnotifications%26email_token%3DABR65PZ2ZIQL2NIJ6MLILWLRDP6WFA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IOLN35A&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472067700&sdata=AwetHAPa%2F5xXfDs7zP7rP8o%2Bz8%2FmrxaNgg%2FzxlIz7ZU%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P6VEWYT5FQTJ4X4HGTRDP6WFANCNFSM4KXHBRUQ&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472077705&sdata=Lw5plkWrevkncY9pQcx7CCTSVSqvLcobDrIGXKeOrcM%3D&reserved=0.
Hi Carl: Thank you so much for the detailed reply!
It may make more sense to have the users to decide what to do with the missing values in the covariate file. In addition to filling in with the mean, some may prefer imputation or simply ignore the samples with missing covariates (a waste, but if bias may be introduced by mean/imputation it’s better not to use).
Thanks again for your help!
Bao-Li
From: Carl Kadie notifications@github.com Sent: Tuesday, February 18, 2020 3:36 PM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com Cc: Chang, Bao-Li baoli@pennmedicine.upenn.edu; Author author@noreply.github.com Subject: [External] Re: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)
Baolichang,
Thanks for using FaSTLMM. Great question and one that I did not know the answer to until I looked into it.
It doesn't! Specifically, when I looked at single_snp and single_snp_scale, I found that any NaN's will end up producing a NaN in the final output.
Work around: Fill missing with the column mean as a pre-processing step
Generate some sample data import numpy as np from pysnptools.snpreader import SnpData
np.random.seed(92392) covar = SnpData(iid=[('f0','iid0'),('f0','iid2'),('f0','iid3')], sid = ['height','weight'], val = np.random.random((3,2))) covar.val[0,0] = np.nan covar.val[2,1] = np.nan covar.val
array([[ nan, 0.17374433],
[0.54600136, 0.52250945],
[0.87375742, nan]])
The actual work around that fills in missing
covar = covar.read(view_ok=True) #Turn the covariate into an in-memory SnpData
inds = np.where(np.isnan(covar.val))
covar.val[inds] = np.take(np.nanmean(covar.val, axis=0), inds[1])
covar.val
array([[0.70987939, 0.17374433],
[0.54600136, 0.52250945],
[0.87375742, 0.34812689]])
In contrast, for SNP data, any missing values are filled in with the mean. For a single phenotype, any individuals with missing values are remove. single_snp_scale also supports multiple phenotypes. If there are multiple phenotypes and any missing value, it raise an error.
Do you think we should change the code do fill in missing covariates automatically? Or is it safer to just have users to this preprocessing step as needed?
Carl
Carl Kadie, Ph.D. FaST-LMM & PySnpTools Team (Microsoft Research, retired) Join the FaST-LMM user discussion and announcement list via emailmailto:fastlmm-user-join@python.org?subject=Subscribe (or use web sign uphttps://mail.python.org/mailman3/lists/fastlmm-user.python.org)
From: baolichang notifications@github.com<mailto:notifications@github.com> Sent: Tuesday, February 18, 2020 7:42:26 AM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com<mailto:FaST-LMM@noreply.github.com> Cc: Subscribed subscribed@noreply.github.com<mailto:subscribed@noreply.github.com> Subject: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)
Hello: This is not really an issue, but I'm not sure how fastlmm deals with samples with missing covariates? Are the samples still included in the calculation but the missing covariate values are filled in as mean/median of all samples (or through imputation)? Or the samples are dropped out of the analysis? Thank you!
- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastlmm%2FFaST-LMM%2Fissues%2F4%3Femail_source%3Dnotifications%26email_token%3DABR65PZ2ZIQL2NIJ6MLILWLRDP6WFA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IOLN35A&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472067700&sdata=AwetHAPa%2F5xXfDs7zP7rP8o%2Bz8%2FmrxaNgg%2FzxlIz7ZU%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P6VEWYT5FQTJ4X4HGTRDP6WFANCNFSM4KXHBRUQ&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472077705&sdata=Lw5plkWrevkncY9pQcx7CCTSVSqvLcobDrIGXKeOrcM%3D&reserved=0.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/4?email_source=notifications&email_token=AI24KU6W3S4RMIBY6CV4L7TRDRBBVA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMES2LY#issuecomment-587803951, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI24KU527JLM55IHH2TZMKLRDRBBVANCNFSM4KXHBRUQ.
Bao-Li,
Excellent suggestions. At the very least I’ll add check for missing values and raise a clear error (in single_snp and single_snp_scale).
From: baolichangmailto:notifications@github.com Sent: Thursday, February 20, 2020 10:50 AM To: fastlmm/FaST-LMMmailto:FaST-LMM@noreply.github.com Cc: Carl Kadiemailto:carlk@msn.com; State changemailto:state_change@noreply.github.com Subject: Re: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)
Hi Carl: Thank you so much for the detailed reply!
It may make more sense to have the users to decide what to do with the missing values in the covariate file. In addition to filling in with the mean, some may prefer imputation or simply ignore the samples with missing covariates (a waste, but if bias may be introduced by mean/imputation it’s better not to use).
Thanks again for your help!
Bao-Li
From: Carl Kadie notifications@github.com Sent: Tuesday, February 18, 2020 3:36 PM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com Cc: Chang, Bao-Li baoli@pennmedicine.upenn.edu; Author author@noreply.github.com Subject: [External] Re: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)
Baolichang,
Thanks for using FaSTLMM. Great question and one that I did not know the answer to until I looked into it.
It doesn't! Specifically, when I looked at single_snp and single_snp_scale, I found that any NaN's will end up producing a NaN in the final output.
Work around: Fill missing with the column mean as a pre-processing step
Generate some sample data import numpy as np from pysnptools.snpreader import SnpData
np.random.seed(92392) covar = SnpData(iid=[('f0','iid0'),('f0','iid2'),('f0','iid3')], sid = ['height','weight'], val = np.random.random((3,2))) covar.val[0,0] = np.nan covar.val[2,1] = np.nan covar.val
array([[ nan, 0.17374433],
[0.54600136, 0.52250945],
[0.87375742, nan]])
The actual work around that fills in missing
covar = covar.read(view_ok=True) #Turn the covariate into an in-memory SnpData
inds = np.where(np.isnan(covar.val))
covar.val[inds] = np.take(np.nanmean(covar.val, axis=0), inds[1])
covar.val
array([[0.70987939, 0.17374433],
[0.54600136, 0.52250945],
[0.87375742, 0.34812689]])
In contrast, for SNP data, any missing values are filled in with the mean. For a single phenotype, any individuals with missing values are remove. single_snp_scale also supports multiple phenotypes. If there are multiple phenotypes and any missing value, it raise an error.
Do you think we should change the code do fill in missing covariates automatically? Or is it safer to just have users to this preprocessing step as needed?
Carl
Carl Kadie, Ph.D. FaST-LMM & PySnpTools Team (Microsoft Research, retired) Join the FaST-LMM user discussion and announcement list via emailmailto:fastlmm-user-join@python.org?subject=Subscribe (or use web sign uphttps://mail.python.org/mailman3/lists/fastlmm-user.python.org)
From: baolichang notifications@github.com<mailto:notifications@github.com> Sent: Tuesday, February 18, 2020 7:42:26 AM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com<mailto:FaST-LMM@noreply.github.com> Cc: Subscribed subscribed@noreply.github.com<mailto:subscribed@noreply.github.com> Subject: [fastlmm/FaST-LMM] How fastlmm deal with "NA" in covariate file? (#4)
Hello: This is not really an issue, but I'm not sure how fastlmm deals with samples with missing covariates? Are the samples still included in the calculation but the missing covariate values are filled in as mean/median of all samples (or through imputation)? Or the samples are dropped out of the analysis? Thank you!
- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastlmm%2FFaST-LMM%2Fissues%2F4%3Femail_source%3Dnotifications%26email_token%3DABR65PZ2ZIQL2NIJ6MLILWLRDP6WFA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IOLN35A&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472067700&sdata=AwetHAPa%2F5xXfDs7zP7rP8o%2Bz8%2FmrxaNgg%2FzxlIz7ZU%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P6VEWYT5FQTJ4X4HGTRDP6WFANCNFSM4KXHBRUQ&data=02%7C01%7C%7Cb5a738b528cd442055a608d7b48927df%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637176373472077705&sdata=Lw5plkWrevkncY9pQcx7CCTSVSqvLcobDrIGXKeOrcM%3D&reserved=0.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/fastlmm/FaST-LMM/issues/4?email_source=notifications&email_token=AI24KU6W3S4RMIBY6CV4L7TRDRBBVA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMES2LY#issuecomment-587803951, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI24KU527JLM55IHH2TZMKLRDRBBVANCNFSM4KXHBRUQ.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastlmm%2FFaST-LMM%2Fissues%2F4%3Femail_source%3Dnotifications%26email_token%3DABR65P4CUSJVA4JI24XXTILRD3GGDA5CNFSM4KXHBRU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMPTVCY%23issuecomment-589249163&data=02%7C01%7C%7Cb728bfc6f52146489f9e08d7b635b66f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637178214113900649&sdata=6Q4PjGeL3L5bVGEPnGiuRlyOeSUWwM%2FD6OGDnHA07lk%3D&reserved=0, or unsubscribehttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P7YDY3O5K3RE4ITVT3RD3GGDANCNFSM4KXHBRUQ&data=02%7C01%7C%7Cb728bfc6f52146489f9e08d7b635b66f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637178214113910636&sdata=QLk1Mt3Vl52PrPP0Ck1pUF2WZI%2BpHxwO%2BX46XBrhwtk%3D&reserved=0.
Hello: This is not really an issue, but I'm not sure how fastlmm deals with samples with missing covariates? Are the samples still included in the calculation but the missing covariate values are filled in as mean/median of all samples (or through imputation)? Or the samples are dropped out of the analysis? Thank you!