Open davmlaw opened 3 years ago
Cohorts are a really good way of managing this, just that the db is getting big and the old data are messy and disorganised.
Sometimes if you re-arrange a query it's way faster.
Eg with cohorts you have to process a dozen massive VCFs then start with 4M variants, then restrict to a gene.
It's faster to START with 200 variants in a gene, THEN check lots of VCFs / patients etc for certain phenotypes
Somewhat similar to - #32 - Samples that have a variant in the gene
As a quick test, I took an all variants node = GATA2, then filtered to ClinVar LP/P = 59 variants.
Then the 60 VCFs from GA:
vcfs = VCF.objects.filter(project__name='Genomic Autopsy (Australian Perinatal Death Study)')
In [15]: vcfs.count()
Out[15]: 60
Then filter to variants in those VCFs (regardless of zygosity)
qs2 = qs.filter(cohortgenotype__collection__cohort__vcf__in=vcfs)
In [19]: %time qs2.count()
CPU times: user 5.43 ms, sys: 2.93 ms, total: 8.37 ms
Wall time: 2 s
Out[19]: 1
People are often interested in gene X. Do we have any patients with interesting variants?
At the moment we can do it using the all variants node, but then you end up having to go through the ~200 variants by hand - most of the time they're in control samples or samples with totally unrelated phenotypes etc.
So basically when he says that he means can you look at affecteds in the certain cohorts, not parents or marry-ins.