Optimize query to check that plant_id exists based on plant properties

This is a rough PR focused on optimizing the check during data upload that the plant record described for a given cell exists or not. This optimization is required to get some large-scale phenotypic data in. However, it is a bit rough and not ideal as we have moved development over to Tripal 4 and this is more a stop-gap measure to get these data in before the Tripal 4 version of this module is done.

Description

This PR optimizes the changed query in the following ways:

switches the project sub-query to a full join for better use of pheno_plant_project_project index.
breaks up the row identifier (previously $plot parameter) into its components as php is much faster at this than postgresql
switches row identifier plant properties to individual statements in the where clause so that the pheno_plant_stock, pheno_plantprop_multi and pheno_measurements_multi indices are used.
removes sub-queries looking up cvterms and instead hard-codes them 🙈 This should be changed to using variable_set at the beginning of the upload job (use same queries to look them up in chado for each job) and then variable_get here.

Note

In addition to these changes one should add the following indices:

CREATE INDEX pheno_plant_stock ON pheno_plant (stock_id);
CREATE INDEX pheno_plant_project_project ON pheno_plant_project (project_id);
CREATE INDEX pheno_plantprop_multi ON pheno_plantprop (plant_id, type_id);
CREATE INDEX pheno_measurements_multi ON pheno_measurements (plant_id, type_id, unit_id);

We are not doing this in the module code for Tripal 3 but indices will be made automatically in the Tripal 4 version of this module.

These indices are in use and fully tested on KnowPulse.

Testing

I do not have a test environment set up so this code has not yet been tested 🙈 Can you check this on a clone for me @reynoldtan and add the variable get/set as mentioned?

I did test the new query and indices on KnowPulse using postgresql analyze explain. The original query took ~700 ms and the new one takes 0.797ms on the same dataset and with the same parameters.

The specific query with parameters used to test was:

EXPLAIN ANALYZE SELECT                                                                                                          
      t1.plant_id
    FROM pheno_plant AS t1
      INNER JOIN pheno_plant_project AS proj USING (plant_id)
      INNER JOIN pheno_plantprop AS t2 USING(plant_id)
      INNER JOIN pheno_plantprop AS t3 USING(plant_id)
      INNER JOIN pheno_plantprop AS t4 USING(plant_id)
      INNER JOIN pheno_measurements AS t5 USING(plant_id)
    WHERE
      t1.stock_id = 44275 
      AND proj.project_id = 147
      AND t2.type_id = 3986
      AND t3.type_id = 3988
      AND t4.type_id = 3989
      AND t5.type_id = 3990
      AND t2.value = '1026' AND t3.value = '1' AND t4.value = 'Sutherland, Canada' AND SUBSTRING(t5.value, 1, 4) = '2020';

And the query builder response was

                                                                              QUERY PLAN                                                                               
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=78.09..623.75 rows=1 width=4) (actual time=0.316..0.686 rows=1 loops=1)
   Join Filter: (t1.plant_id = t5.plant_id)
   ->  Nested Loop  (cost=77.66..617.62 rows=1 width=32) (actual time=0.271..0.639 rows=1 loops=1)
         Join Filter: (t1.plant_id = t4.plant_id)
         ->  Nested Loop  (cost=77.24..615.38 rows=1 width=24) (actual time=0.264..0.631 rows=1 loops=1)
               Join Filter: (t1.plant_id = t2.plant_id)
               ->  Nested Loop  (cost=76.82..613.14 rows=1 width=16) (actual time=0.256..0.623 rows=1 loops=1)
                     Join Filter: (t1.plant_id = t3.plant_id)
                     ->  Hash Join  (cost=76.40..605.98 rows=1 width=8) (actual time=0.234..0.600 rows=1 loops=1)
                           Hash Cond: (proj.plant_id = t1.plant_id)
                           ->  Bitmap Heap Scan on pheno_plant_project proj  (cost=19.48..545.57 rows=927 width=4) (actual time=0.165..0.387 rows=952 loops=1)
                                 Recheck Cond: (project_id = 147)
                                 Heap Blocks: exact=7
                                 ->  Bitmap Index Scan on pheno_plant_project_project  (cost=0.00..19.25 rows=927 width=0) (actual time=0.153..0.153 rows=952 loops=1)
                                       Index Cond: (project_id = 147)
                           ->  Hash  (cost=56.73..56.73 rows=15 width=4) (actual time=0.024..0.024 rows=2 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                 ->  Bitmap Heap Scan on pheno_plant t1  (cost=4.41..56.73 rows=15 width=4) (actual time=0.017..0.021 rows=2 loops=1)
                                       Recheck Cond: (stock_id = 44275)
                                       Heap Blocks: exact=2
                                       ->  Bitmap Index Scan on pheno_plant_stock  (cost=0.00..4.41 rows=15 width=0) (actual time=0.012..0.012 rows=2 loops=1)
                                             Index Cond: (stock_id = 44275)
                     ->  Index Scan using pheno_plantprop_multi on pheno_plantprop t3  (cost=0.42..7.15 rows=1 width=8) (actual time=0.018..0.019 rows=1 loops=1)
                           Index Cond: ((plant_id = proj.plant_id) AND (type_id = 3988))
                           Filter: ((value)::text = '1'::text)
               ->  Index Scan using pheno_plantprop_multi on pheno_plantprop t2  (cost=0.42..2.22 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=1)
                     Index Cond: ((plant_id = t3.plant_id) AND (type_id = 3986))
                     Filter: ((value)::text = '1026'::text)
         ->  Index Scan using pheno_plantprop_multi on pheno_plantprop t4  (cost=0.42..2.22 rows=1 width=8) (actual time=0.005..0.005 rows=1 loops=1)
               Index Cond: ((plant_id = t3.plant_id) AND (type_id = 3989))
               Filter: ((value)::text = 'Sutherland, Canada'::text)
   ->  Index Scan using pheno_measurements_multi on pheno_measurements t5  (cost=0.43..6.12 rows=1 width=8) (actual time=0.043..0.045 rows=1 loops=1)
         Index Cond: ((plant_id = t3.plant_id) AND (type_id = 3990))
         Filter: ("substring"((value)::text, 1, 4) = '2020'::text)
 Planning time: 7.019 ms
 Execution time: 0.797 ms

UofS-Pulse-Binfo / rawphenotypes