brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
262 stars 35 forks source link

subset unrelated samples for large cohorts #32

Closed brentp closed 4 years ago

brentp commented 4 years ago

from #31

with > 2K samples, the html output becomes nearly unusable. but in large cohorts, nearly all samples will be unrelated. we can sub-sample pairs that are expected to be unrelated and appear unrelated by phenotype.

this will reduce the memory usage and make somalier html output useful for huge cohorts. it will require a substantial change in the html as that's expecting the full matrix of all vs all. will need a sparse representation instead.

brentp commented 4 years ago

this is done in v0.2.6 release