Open murphyke opened 10 years ago
Expect a roughly six-fold performance improvement for this query if all SQL redundancy is removed. The time for psql to execute all the SQL queries behind this Varify query over the local network is 42.8 sec. If all of the redundant queries are removed via | sort | uniq
, the time taken is 7.2 sec. The database server cache was warm in both cases. The real-world effect including Python and caching overhead remains to be seen.
A typical Varify query in our installation generates between five and six thousand SQL queries to show the initial 20 HTML results.
Varify could benefit greatly from some caching or elimination of redundant database lookups. (If there is caching in place that doesn't seem to be working, could it possibly be due to memcached's default 1MB object limit or a too-small MAX_ENTRIES value in the Django CACHES setting?)
The main problem is highly redundant pmid queries. These add up to about the same as the big money queries (the top 3 slowest) and in terms of overall processing time may be significantly more costly.
I doubt the details of the query matter, but my test query was as follows:
Sample is one of the following: P-PSeq_0038-P-A from project U01 (CPF1308003) P-Pseq_0043-P-A from project U01 (CPF1309006) P-Pseq_0044-P-A from project U01 (CPF1309006) P-Pseq_0045-P-A from project U01 (CPF1309006) 1007 from project U01 (NEMRCD) 1007p1 from project U01 (NEMRCD) P-PSeq0065-P-A from project U01 (Pseq_batch17) Gene Symbol is BRCA1 Effect is Non-Synonymous Coding
SQL was collected for the first page of HTML results.
The biggest offenders were the pmid queries:
1800 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 7061 960 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 7888 360 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 7887 240 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 7374 120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 8044 120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 7609 120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 6878 120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 12332 120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 11075 120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 11074 120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 11070 120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 10401 120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 10397 120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "gene_pubmed" ON ("pubmed"."pmid" = "gene_pubmed"."pubmed_id") WHERE "gene_pubmed"."gene_id" = 2071 6 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "variant_pubmed" ON ("pubmed"."pmid" = "variant_pubmed"."pubmed_id") WHERE "variant_pubmed"."variant_id" = 254416 5 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "variant_pubmed" ON ("pubmed"."pmid" = "variant_pubmed"."pubmed_id") WHERE "variant_pubmed"."variant_id" = 254415 5 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "variant_pubmed" ON ("pubmed"."pmid" = "variant_pubmed"."pubmed_id") WHERE "variant_pubmed"."variant_id" = 254414 3 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "variant_pubmed" ON ("pubmed"."pmid" = "variant_pubmed"."pubmed_id") WHERE "variant_pubmed"."variant_id" = 1108096 1 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "variant_pubmed" ON ("pubmed"."pmid" = "variant_pubmed"."pubmed_id") WHERE "variant_pubmed"."variant_id" = 254403
There are also a number of queries that are each executed 120 times, e.g.:
20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75417 20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75416 20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75415 20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75414 20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75413 20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75412
These are similar redundant queries for effect, gene, phenotype, pmid, and effect_impact.
I don't really understand the redundancy patterns. E.g. for gene, we have:
120 SELECT "gene"."id", "gene"."chr_id", "gene"."symbol", "gene"."name", "gene"."hgnc_id" FROM "gene" WHERE "gene"."id" = 2071
whereas for phenotype we have:
120 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" INNER JOIN "gene_phenotype" ON ("phenotype"."id" = "gene_phenotype"."phenotype_id") WHERE "gene_phenotype"."gene_id" = 2071 6 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 11075 5 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 11074 5 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 11070 3 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 12332 1 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 7887 1 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 7061