Inspecting (possibly best-call) imputed parental genotypes?

sjoerdvanalten commented 3 years ago

Hi Alex,

Is there any straightforward way to extract the parental genotypes imputed in impute_runner.py, and stored in the .hdf5 format, to a more conventional format that can be read by software such as PLINK? I would be happy with either a format that supports dosages (e.g. vcf), or best-called genotypes (e.g. .bed).

If not, I might attempt to write such a routine myself, but any leads would be greatly appreciated.

Kind regards,

Sjoerd

AlexTISYoung commented 3 years ago

Hi Sjoerd,

It is possible to read the imputed parental genotypes from the HDF5 file using packages such as rhdf5 for R or h5py for Python. The imputed parental genotypes are stored in a dataset 'imputed_par_gts' in the HDF5 file. This is a [N_fam x L] matrix, where N_fam is the number of families for which parental genotypes are imputed. The family IDs for each genotyped individual are stored in the inferred pedigree in the HDF5 file (dataset 'pedigree'). For example, if we have two siblings, we would assign them a family ID '0', which would be recorded in the 'pedigree' dataset, and the value of the imputed parental genotypes for these siblings is stored in row of the 'imputed_par_gts' matrix corresponding to family ID '0'. The family ID of each row is recorded in the dataset 'families' in the HDF5 file. I note that, for imputation from siblings, we store only one imputed parental genotype value for each SNP, which is the same for both mothers and fathers, whereas for imputation from families with one parent genotyped (i.e. we are imputing the missing parent), we store the imputed genotype for the missing parent.

We do not currently offer support for converting them to a format such as VCF or PLINK. I would not recommend trying to convert the imputed parental genotypes to best-called genotypes. The imputed parental genotypes will not, in general, be close to integer values, unless they can be determined exactly from the observed genotypes in the family, which is only possible when we observe a sibling pair in IBD0 state. Converting to best-called genotypes would likely lead to bias in downstream analyses.

Let me know if you have any further questions.

Thanks,

Alex.

On Wed, 27 Oct 2021 at 07:53, Sjoerd van Alten @.***> wrote:

Hi Alex,

Is there any straightforward way to extract the parental genotypes imputed in impute_runner.py, and stored in the .hdf5 format, to a more conventional format that can be read by software such as PLINK? I would be happy with either a format that supports dosages (e.g. vcf), or best-called genotypes (e.g. .bed).

If not, I might attempt to write such a routine myself, but any leads would be greatly appreciated.

Kind regards,

Sjoerd

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AlexTISYoung/SNIPar/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQQS6P5CUPQJQYU5LDVRSTUJAN7RANCNFSM5G2R7LYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sjoerdvanalten commented 3 years ago

Thanks for your elaborate (and quick) answer. These tips should be sufficient for me to get the job done.

AlexTISYoung / snipar

Inspecting (possibly best-call) imputed parental genotypes? #15