A Tripal module for storing raw phenotypic data. Specifically meant to help researchers contribute raw data, visualize summaries and download for further analysis.
2
stars
2
forks
source link
Optimize query to check that plant_id exists based on plant properties #102
This is a rough PR focused on optimizing the check during data upload that the plant record described for a given cell exists or not. This optimization is required to get some large-scale phenotypic data in. However, it is a bit rough and not ideal as we have moved development over to Tripal 4 and this is more a stop-gap measure to get these data in before the Tripal 4 version of this module is done.
Description
This PR optimizes the changed query in the following ways:
switches the project sub-query to a full join for better use of pheno_plant_project_project index.
breaks up the row identifier (previously $plot parameter) into its components as php is much faster at this than postgresql
switches row identifier plant properties to individual statements in the where clause so that the pheno_plant_stock, pheno_plantprop_multi and pheno_measurements_multi indices are used.
removes sub-queries looking up cvterms and instead hard-codes them 🙈 This should be changed to using variable_set at the beginning of the upload job (use same queries to look them up in chado for each job) and then variable_get here.
Note
In addition to these changes one should add the following indices:
CREATE INDEX pheno_plant_stock ON pheno_plant (stock_id);
CREATE INDEX pheno_plant_project_project ON pheno_plant_project (project_id);
CREATE INDEX pheno_plantprop_multi ON pheno_plantprop (plant_id, type_id);
CREATE INDEX pheno_measurements_multi ON pheno_measurements (plant_id, type_id, unit_id);
We are not doing this in the module code for Tripal 3 but indices will be made automatically in the Tripal 4 version of this module.
These indices are in use and fully tested on KnowPulse.
Testing
I do not have a test environment set up so this code has not yet been tested 🙈 Can you check this on a clone for me @reynoldtan and add the variable get/set as mentioned?
I did test the new query and indices on KnowPulse using postgresql analyze explain. The original query took ~700 ms and the new one takes 0.797ms on the same dataset and with the same parameters.
The specific query with parameters used to test was:
EXPLAIN ANALYZE SELECT
t1.plant_id
FROM pheno_plant AS t1
INNER JOIN pheno_plant_project AS proj USING (plant_id)
INNER JOIN pheno_plantprop AS t2 USING(plant_id)
INNER JOIN pheno_plantprop AS t3 USING(plant_id)
INNER JOIN pheno_plantprop AS t4 USING(plant_id)
INNER JOIN pheno_measurements AS t5 USING(plant_id)
WHERE
t1.stock_id = 44275
AND proj.project_id = 147
AND t2.type_id = 3986
AND t3.type_id = 3988
AND t4.type_id = 3989
AND t5.type_id = 3990
AND t2.value = '1026' AND t3.value = '1' AND t4.value = 'Sutherland, Canada' AND SUBSTRING(t5.value, 1, 4) = '2020';
And the query builder response was
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=78.09..623.75 rows=1 width=4) (actual time=0.316..0.686 rows=1 loops=1)
Join Filter: (t1.plant_id = t5.plant_id)
-> Nested Loop (cost=77.66..617.62 rows=1 width=32) (actual time=0.271..0.639 rows=1 loops=1)
Join Filter: (t1.plant_id = t4.plant_id)
-> Nested Loop (cost=77.24..615.38 rows=1 width=24) (actual time=0.264..0.631 rows=1 loops=1)
Join Filter: (t1.plant_id = t2.plant_id)
-> Nested Loop (cost=76.82..613.14 rows=1 width=16) (actual time=0.256..0.623 rows=1 loops=1)
Join Filter: (t1.plant_id = t3.plant_id)
-> Hash Join (cost=76.40..605.98 rows=1 width=8) (actual time=0.234..0.600 rows=1 loops=1)
Hash Cond: (proj.plant_id = t1.plant_id)
-> Bitmap Heap Scan on pheno_plant_project proj (cost=19.48..545.57 rows=927 width=4) (actual time=0.165..0.387 rows=952 loops=1)
Recheck Cond: (project_id = 147)
Heap Blocks: exact=7
-> Bitmap Index Scan on pheno_plant_project_project (cost=0.00..19.25 rows=927 width=0) (actual time=0.153..0.153 rows=952 loops=1)
Index Cond: (project_id = 147)
-> Hash (cost=56.73..56.73 rows=15 width=4) (actual time=0.024..0.024 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Bitmap Heap Scan on pheno_plant t1 (cost=4.41..56.73 rows=15 width=4) (actual time=0.017..0.021 rows=2 loops=1)
Recheck Cond: (stock_id = 44275)
Heap Blocks: exact=2
-> Bitmap Index Scan on pheno_plant_stock (cost=0.00..4.41 rows=15 width=0) (actual time=0.012..0.012 rows=2 loops=1)
Index Cond: (stock_id = 44275)
-> Index Scan using pheno_plantprop_multi on pheno_plantprop t3 (cost=0.42..7.15 rows=1 width=8) (actual time=0.018..0.019 rows=1 loops=1)
Index Cond: ((plant_id = proj.plant_id) AND (type_id = 3988))
Filter: ((value)::text = '1'::text)
-> Index Scan using pheno_plantprop_multi on pheno_plantprop t2 (cost=0.42..2.22 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=1)
Index Cond: ((plant_id = t3.plant_id) AND (type_id = 3986))
Filter: ((value)::text = '1026'::text)
-> Index Scan using pheno_plantprop_multi on pheno_plantprop t4 (cost=0.42..2.22 rows=1 width=8) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: ((plant_id = t3.plant_id) AND (type_id = 3989))
Filter: ((value)::text = 'Sutherland, Canada'::text)
-> Index Scan using pheno_measurements_multi on pheno_measurements t5 (cost=0.43..6.12 rows=1 width=8) (actual time=0.043..0.045 rows=1 loops=1)
Index Cond: ((plant_id = t3.plant_id) AND (type_id = 3990))
Filter: ("substring"((value)::text, 1, 4) = '2020'::text)
Planning time: 7.019 ms
Execution time: 0.797 ms
This is a rough PR focused on optimizing the check during data upload that the plant record described for a given cell exists or not. This optimization is required to get some large-scale phenotypic data in. However, it is a bit rough and not ideal as we have moved development over to Tripal 4 and this is more a stop-gap measure to get these data in before the Tripal 4 version of this module is done.
Description
This PR optimizes the changed query in the following ways:
Note
In addition to these changes one should add the following indices:
We are not doing this in the module code for Tripal 3 but indices will be made automatically in the Tripal 4 version of this module.
These indices are in use and fully tested on KnowPulse.
Testing
I do not have a test environment set up so this code has not yet been tested 🙈 Can you check this on a clone for me @reynoldtan and add the variable get/set as mentioned?
I did test the new query and indices on KnowPulse using postgresql analyze explain. The original query took ~700 ms and the new one takes 0.797ms on the same dataset and with the same parameters.
The specific query with parameters used to test was:
And the query builder response was