CAST-genomics / haptools

Ancestry and haplotype aware simulation of genotypes and phenotypes for complex trait analysis
https://haptools.readthedocs.io
MIT License
19 stars 4 forks source link

feat: Simphenotype and Index Repeat Support #209

Closed mlamkin7 closed 1 year ago

mlamkin7 commented 1 year ago

We've updated the simphenotype and index subcommands to support a new line type in the hap file "R". R stands for repeats

Usage in a sorted hap file (tests/data/basic.hap.gz):

#       version 0.1.0
H       21      26928472        26941960        chr21.q.3365*1
R       21      26938353        26938400        21_26938353_STR
H       21      26938353        26938989        chr21.q.3365*11
H       21      26938989        26941960        chr21.q.3365*10
R       21      26939000        26939010        21_26938989_STR
R       21      26941880        26941900        21_26941880_STR
V       chr21.q.3365*1  26928472        26928472        21_26928472_C_A C
V       chr21.q.3365*1  26938353        26938353        21_26938353_T_C T
V       chr21.q.3365*1  26940815        26940815        21_26940815_T_C C
V       chr21.q.3365*1  26941960        26941960        21_26941960_A_G G
V       chr21.q.3365*10 26938989        26938989        21_26938989_G_A A
V       chr21.q.3365*10 26940815        26940815        21_26940815_T_C T
V       chr21.q.3365*10 26941960        26941960        21_26941960_A_G A
V       chr21.q.3365*11 26938353        26938353        21_26938353_T_C T
V       chr21.q.3365*11 26938989        26938989        21_26938989_G_A A

Along with these changes are additional changes in simphenotypes PhenoSimulator class particularly the run() function which now instead of taking in a list of haplotypes takes in the full Haplotypes object as well as the IDs of haplotypes and repeats to extract betas and genotypes.

To use repeats in simphenotype, use the additional --repeats option. Example:

haptools simphenotype --repeats repeats.vcf snps.vcf snps_and_repeats.hap

Note in the example SNPs must also still be present, so we cannot simulate based on repeats alone.