cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
662 stars 527 forks source link

Problem with character set encoding in MySQL 8 and index on clinical_event_data.VALUE field #10411

Open pvannierop opened 1 year ago

pvannierop commented 1 year ago

Problem

When upgrading to MySQL 8 there is an error thrown by the schema provisioning step:

Running statements for version: 2.13.1
    Executing statement: ALTER TABLE `clinical_event_data` MODIFY COLUMN `VALUE` varchar(3000) NOT NULL;
    Executing statement: CREATE INDEX idx_clinical_event_key ON clinical_event_data (`KEY`);
    Executing statement: CREATE INDEX idx_clinical_event_value ON clinical_event_data (`VALUE`);
(1071, 'Specified key was too long; max key length is 3072 bytes')

Cause

There is a byte length limit on keys/indexes. The default character encoding in MySQL 5 (latin1) uses 3 bytes for a character. The default character encoding in MySQL 8 (utf8mb4) uses 4 bytes for a character. The index on the _clinical_eventdata.VALUE column of type VARCHAR(5000) (!) will hit the max byte length for the key when defined in utf8mb4 encoding.

Temporary fix

Adding --character-set-server=latin1 and _--collation-server=latin1_swedishci to the startup parameters of MySQL 8 will switch back to latin thereby suppressing the error.

Long term fix

The fundamental problem is that there is an index on the string _clinical_eventdata.VALUE column that is VARCHAR(5000). AFAIKS This index is used in a very strange way in TreatmentMapper.xml:

 <select id="getAllSamples" resultType="org.cbioportal.model.ClinicalEventSample">
        SELECT
            sample.STABLE_ID as sampleId,
            patient.STABLE_ID as patientId,
            cancer_study.CANCER_STUDY_IDENTIFIER as studyId,
            clinical_event.START_DATE as timeTaken
        FROM
            clinical_event
            INNER JOIN clinical_event_data ON clinical_event.CLINICAL_EVENT_ID = clinical_event_data.CLINICAL_EVENT_ID
            INNER JOIN patient ON clinical_event.PATIENT_ID = patient.INTERNAL_ID
            INNER JOIN sample ON clinical_event_data.VALUE = sample.STABLE_ID
            INNER JOIN cancer_study ON patient.CANCER_STUDY_ID = cancer_study.CANCER_STUDY_ID
        <include refid="where"/>
            AND clinical_event_data.KEY = 'SAMPLE_ID'
            AND (clinical_event.EVENT_TYPE LIKE 'Sample Acquisition' OR clinical_event.EVENT_TYPE LIKE 'SPECIMEN')
    </select>

Notice that there is a join clinical_event_data.VALUE = sample.STABLE_ID when clinical_event_data.KEY = 'SAMPLE_ID'. Obviously, there is a need to connect a sample (only one sample allowed when defined this way) to a clinical event.

I think this is incorrect database design. This _clinicalevent --> sample relationship should be captured in a separate linking _clinical_event_to_sample_ table.

This design will capture the event to sample relationship in a recognizable way and remove the need for the index on the VALUE column. Also, it will allow for one or more samples to be connected to a clinical event, which makes more sense IMO.

sheridancbio commented 1 year ago

Just referencing #4345 where there were explorations and discussion of this topic previously Other related issues/PRs:

1801

5765

sheridancbio commented 1 year ago

This topic is slightly different because it relates to a different table. One approach we took back then was reducing the indexing to be based not on the full field but only to a limited prefix of the full field. That might be another approach with this table.

But I think a fundamental fix of this problem would involve (as a project) making specifications about what database system configuration options we want to support for deployers. Do we wish to be friendly to non-mysql RDBMS? Do we wish to be friendly to deployments using character sets other than latin1? Do we wish to function well without foreign key constraints? Based on those decisions we could restrict our indexes and foreign keys so that we don't exceed the bit limit for any configuration we want to be compatible with. At MSK we chose to make our deployment have all text fields use a utf-8 encoding and all tables use the INNODB engine (at last at a point in the past that was the design). It gets tricky creating foreign keys when using a mixture of engines. There was a requirement that a foreign key must connect two tables which the same engine. So for compatibility we went for uniformity. We also had non-ascii characters in some of our tables at the time, so utf-8 seemed wise.

sheridancbio commented 1 year ago

For deciding on the next steps to address this problem, I would recommend against committing to a latin1 encoding. There is currently an index in the mutation_event table which includes field TUMOR_SEQ_ALLELE of type "text". We have seen values (occasionally for very long insertions) in this field with over 5000 nucleotides (characters). I think sometimes we will have a value in a field which is very long, but we will also want to compute an index which includes the field in some way for efficient search/lookup/uniqueness checks. The approach we took for TUMOR_SEQ_ALLELE was to limit the indexing to only the first X (= 240 in this case ) characters of any value in the field. I suggest a similar approach for clinical_event_value. If we limit the indexing to a (large) prefix of the possible field value, we could still support utf-8 encoding.