cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
630 stars 481 forks source link

Optimize study view clinical data search with denormalized view #10812

Closed alisman closed 2 months ago

alisman commented 4 months ago

The main bottlenecks for clinical data tab are

  1. live JOINs
  2. search

On genie dataset (190k samples), a search takes ~30s. These optimizations bring that down to 6s.

This PR addresses these bottlenecks by

  1. Adding a new table (spec below) which precomputes joins.
  2. Adding indexes (spec below) and using fulltext index on ATTR_VALUE and corresponding
CREATE TABLE clinical_data_search_mv (
                                         ATTR_VALUE varchar(255),
                                         PATIENT_INTERNAL_ID varchar(255),
                                         SAMPLE_INTERNAL_ID varchar(255),
                                         PATIENT_STABLE_ID varchar(255),
                                         SAMPLE_STABLE_ID varchar(255),
                                         CANCER_STUDY_IDENTIFIER varchar(255)
);

#populate table with data
INSERT INTO clinical_data_search_mv (
    ATTR_VALUE, PATIENT_INTERNAL_ID, SAMPLE_INTERNAL_ID, PATIENT_STABLE_ID, SAMPLE_STABLE_ID, CANCER_STUDY_IDENTIFIER
)
    (SELECT DISTINCT cp.ATTR_VALUE           as ATTR_VALUE,
                     p.INTERNAL_ID           as PATIENT_INTERNAL_ID,
                     s.INTERNAL_ID           as SAMPLE_INTERNAL_ID,
                     p.STABLE_ID             as PATIENT_STABLE_ID,
                     s.STABLE_ID             as SAMPLE_STABLE_ID,
                     CANCER_STUDY_IDENTIFIER as CANCER_STUDY_IDENTIFIER
     FROM clinical_patient cp
              JOIN patient p on cp.INTERNAL_ID = p.INTERNAL_ID
              JOIN sample s on s.PATIENT_ID = p.INTERNAL_ID
              JOIN cancer_study cs on p.CANCER_STUDY_ID = cs.CANCER_STUDY_ID
     WHERE NOT REGEXP_LIKE(cp.ATTR_VALUE, '^-?[0-9.]+$')
     UNION
     SELECT DISTINCT cs.ATTR_VALUE             as ATTR_VALUE,
                     p.INTERNAL_ID             as PATIENT_INTERNAL_ID,
                     s.INTERNAL_ID             as SAMPLE_INTERNAL_ID,
                     p.STABLE_ID               as PATIENT_INTERNAL_ID,
                     s.STABLE_ID               as SAMPLE_STABLE_ID,
                     c.CANCER_STUDY_IDENTIFIER as CANCER_STUDY_IDENTIFIER
     FROM clinical_sample cs
              LEFT JOIN sample s on cs.INTERNAL_ID = s.INTERNAL_ID
              LEFT JOIN patient p on s.PATIENT_ID = p.INTERNAL_ID
              JOIN cancer_study c on p.CANCER_STUDY_ID = c.CANCER_STUDY_ID
     WHERE NOT REGEXP_LIKE(cs.ATTR_VALUE, '^-?[0-9.]+$')

    );

#create necessary indexes
create fulltext index clinical_data_search_mv_ATTR_VALUE_index
    on clinical_data_search_mv (ATTR_VALUE);

create index clinical_data_search_mv_CANCER_STUDY_IDENTIFIER_index
    on clinical_data_search_mv (CANCER_STUDY_IDENTIFIER);
alisman commented 4 months ago

@inodb @sheridancbio the strategy here will require that a new table be populated whenever data is updated. what is the right way to script this so it always happens? likewise, what is protocol for doing schema updates?

inodb commented 4 months ago

@alisman after this you could run updateDenormalizedClinicalDataViews(): https://github.com/cBioPortal/cbioportal-core/blob/main/src/main/java/org/mskcc/cbio/portal/scripts/ImportClinicalData.java#L170