Open pvannierop opened 1 year ago
Just referencing #4345 where there were explorations and discussion of this topic previously Other related issues/PRs:
This topic is slightly different because it relates to a different table. One approach we took back then was reducing the indexing to be based not on the full field but only to a limited prefix of the full field. That might be another approach with this table.
But I think a fundamental fix of this problem would involve (as a project) making specifications about what database system configuration options we want to support for deployers. Do we wish to be friendly to non-mysql RDBMS? Do we wish to be friendly to deployments using character sets other than latin1? Do we wish to function well without foreign key constraints? Based on those decisions we could restrict our indexes and foreign keys so that we don't exceed the bit limit for any configuration we want to be compatible with. At MSK we chose to make our deployment have all text fields use a utf-8 encoding and all tables use the INNODB engine (at last at a point in the past that was the design). It gets tricky creating foreign keys when using a mixture of engines. There was a requirement that a foreign key must connect two tables which the same engine. So for compatibility we went for uniformity. We also had non-ascii characters in some of our tables at the time, so utf-8 seemed wise.
For deciding on the next steps to address this problem, I would recommend against committing to a latin1 encoding. There is currently an index in the mutation_event table which includes field TUMOR_SEQ_ALLELE of type "text". We have seen values (occasionally for very long insertions) in this field with over 5000 nucleotides (characters). I think sometimes we will have a value in a field which is very long, but we will also want to compute an index which includes the field in some way for efficient search/lookup/uniqueness checks. The approach we took for TUMOR_SEQ_ALLELE was to limit the indexing to only the first X (= 240 in this case ) characters of any value in the field. I suggest a similar approach for clinical_event_value. If we limit the indexing to a (large) prefix of the possible field value, we could still support utf-8 encoding.
Problem
When upgrading to MySQL 8 there is an error thrown by the schema provisioning step:
Cause
There is a byte length limit on keys/indexes. The default character encoding in MySQL 5 (
latin1
) uses 3 bytes for a character. The default character encoding in MySQL 8 (utf8mb4
) uses 4 bytes for a character. The index on the _clinical_eventdata.VALUE column of type VARCHAR(5000) (!) will hit the max byte length for the key when defined inutf8mb4
encoding.Temporary fix
Adding --character-set-server=latin1 and _--collation-server=latin1_swedishci to the startup parameters of MySQL 8 will switch back to
latin
thereby suppressing the error.Long term fix
The fundamental problem is that there is an index on the string _clinical_eventdata.VALUE column that is
VARCHAR(5000)
. AFAIKS This index is used in a very strange way in TreatmentMapper.xml:Notice that there is a join
clinical_event_data.VALUE = sample.STABLE_ID
whenclinical_event_data.KEY = 'SAMPLE_ID'
. Obviously, there is a need to connect a sample (only one sample allowed when defined this way) to a clinical event.I think this is incorrect database design. This _clinicalevent --> sample relationship should be captured in a separate linking
_clinical_event_to_sample_
table.This design will capture the event to sample relationship in a recognizable way and remove the need for the index on the VALUE column. Also, it will allow for one or more samples to be connected to a clinical event, which makes more sense IMO.