Database schema is incompatible with utf8, the default in MySQL 8

fedde-s commented 6 years ago

As found out in #2478, the mutation_event table contains a unique key too large to be accepted by MySQL when using the UTF-8 character set.

Since this is the default encoding in the current release of MySQL, I have documented a recommendation to stick to MySQL 5.7 in #4333, along with a hint at why in case users do want to tweak the configuration.

However, I feel that we should adjust the schema to support UTF-8 and MySQL 8.x out of the box.

jjgao commented 6 years ago

Thanks, @fedde-s.

@khzhu you mentioned that you are using MySQL8. Would you mind sending your changes as a pull request?

cc'ing @n1zea144 and @sheridancbio

khzhu commented 6 years ago

@jjgao, @fedde-s , no problems, I will create pull request to address this. Thank you!

jjgao commented 6 years ago

Thank you, @khzhu

sheridancbio commented 6 years ago

Just catching up with this issue again. It sounds like we are moving towards reducing the tumor_seq_allele field width to 255 characters (which is something we did in our production databases, and diverges from the schema specified in cgds.sql). I think this is probably fine, but I think the reason the length of this field was originally raised above 255 was to capture some very long insertion / deletion events that we were getting through sequencing. I think @angelicaochoa made these adjustments in response to some missing or truncated mutation events we were processing.

I guess in conjunction with this reduction to 255 we should ask the question : what do we do when a mutation event is imported that has an insertion of more than 255 nucleotides? We could make this part of the study validator ... that no mutation event contains an indel with more than this limit of nucleotides reported. Maybe an insert above a certain length gets converted into a structural variant? Or we can store very long insertions in a separate table perhaps? We probably should not be silently truncating or discarding such events. This discussion should probably be in a separate issue. cc: @n1zea144

fedde-s commented 6 years ago

I briefly considered moving the field to a separate two-column table with a numeric primary key. That way the composite key could include the (much smaller) foreign key, and uniqueness of the string itself could be ensured by a single-column constraint, if that makes a difference. Your ideas about structural variants and involving the validator sound promising too though!

fedde-s commented 6 years ago

The quickest fix to write would presumably be to make the validator (and documentation in File-Formats.md) disallow studies that contain 256+-character mutations, pushing the decision of whether to represent them in some different way or discard them to the data curators.

jjgao commented 6 years ago

What would happen if we remove the unique key constraint?

khzhu commented 6 years ago

@jjgao @fedde-s @sheridancbio , I think we should keep the unique key constraint because I have a solution for this. Let me explain in detail.

The character limit depends on the character set you use, If you use latin1 then the largest column you can index is varchar(767=255*3+2), but if you use utf8 then the limit becomes varchar(255). By default, mysql 8 uses utf8.

mysql> SHOW VARIABLES LIKE 'character_set_system';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_system | utf8  |
+----------------------+-------+
1 row in set (0.00 sec)

There is also a separate 3072 byte limit per index, so you can include multiple columns (each 255 or smaller) up to 3072 total bytes. But a column can be no longer than 255 bytes if you use utf8.

Since, the mutation_event violets this rule, we got an "Specified key was too long; max key length is 3072 bytes" when running the cgds.sql script.

mysql> CREATE TABLE `mutation_event` (   `MUTATION_EVENT_ID` int(255) NOT NULL auto_increment,   `ENTREZ_GENE_ID` int(11) NOT NULL,   `CHR` varchar(5),   `START_POSITION` bigint(20),   `END_POSITION` bigint(20),   `REFERENCE_ALLELE` varchar(400),   `TUMOR_SEQ_ALLELE` varchar(400),   `PROTEIN_CHANGE` varchar(255),   `MUTATION_TYPE` varchar(255) COMMENT 'e.g. Missense, Nonsence, etc.',   `FUNCTIONAL_IMPACT_SCORE` varchar(50) COMMENT 'Result from OMA/XVAR.',   `FIS_VALUE` float,   `LINK_XVAR` varchar(500) COMMENT 'Link to OMA/XVAR Landing Page for the specific mutation.',   `LINK_PDB` varchar(500),   `LINK_MSA` varchar(500),   `NCBI_BUILD` varchar(10),   `STRAND` varchar(2),   `VARIANT_TYPE` varchar(15),   `DB_SNP_RS` varchar(25),   `DB_SNP_VAL_STATUS` varchar(255),   `ONCOTATOR_DBSNP_RS` varchar(255),   `ONCOTATOR_REFSEQ_MRNA_ID` varchar(64),   `ONCOTATOR_CODON_CHANGE` varchar(255),   `ONCOTATOR_UNIPROT_ENTRY_NAME` varchar(64),   `ONCOTATOR_UNIPROT_ACCESSION` varchar(64),   `ONCOTATOR_PROTEIN_POS_START` int(11),   `ONCOTATOR_PROTEIN_POS_END` int(11),   `CANONICAL_TRANSCRIPT` boolean,   `KEYWORD` varchar(50) DEFAULT NULL COMMENT 'e.g. truncating, V200 Missense, E338del, ',   KEY (`KEYWORD`),   PRIMARY KEY (`MUTATION_EVENT_ID`),   UNIQUE (`CHR`, `START_POSITION`, `END_POSITION`, `TUMOR_SEQ_ALLELE`, `ENTREZ_GENE_ID`, `PROTEIN_CHANGE`, `MUTATION_TYPE`),   FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`) );
ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes

So, the solution is to overwrite latin1 character set for the mutation_event table. (I could not really think of any reason to have utf8 to this table.) This will allow us to have 767 bytes per column, including multiple columns (each 767 bytes or smaller) up to a total of 3072 bytes per index.

kelseyzhu@tpugh28 cbioportal (master) $ git diff db-scripts/src/main/resources/cgds.sql
diff --git a/db-scripts/src/main/resources/cgds.sql b/db-scripts/src/main/resources/cgds.sql
index 97160821f..7dd1b7d75 100644
--- a/db-scripts/src/main/resources/cgds.sql
+++ b/db-scripts/src/main/resources/cgds.sql
@@ -425,7 +425,7 @@ CREATE TABLE `mutation_event` (
   PRIMARY KEY (`MUTATION_EVENT_ID`),
   UNIQUE (`CHR`, `START_POSITION`, `END_POSITION`, `TUMOR_SEQ_ALLELE`, `ENTREZ_GENE_ID`, `PROTEIN_CHANGE`, `MUTATION_TYPE`),
   FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`)
-) COMMENT='Mutation Data';
+) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;

Hope this makes sense. The pull request should be ready on Monday for your review. Thank you!

jjgao commented 6 years ago

@khzhu Thank you! Very well explained! I don't see a reason to use utf8 for this table either. Could you create a pull request for the database schema change with sql migration code?

khzhu commented 6 years ago

@jjgao ,thank you! I will work on this next week.

fedde-s commented 6 years ago

Nice! That way the institute-specific text fields (such as clinical attributes and their string values, or study names and descriptions, or even cancer type names) can use UTF-8, while this particular column (which I believe should hold a variation on the ASCII-only HGVS protein change notation extended with a _SPLICE notation that used to be produced by MSK's pipeline) can keep working as it used to.

sheridancbio commented 6 years ago

I remember that we debated whether to "take control" of the engine and character set, specifying it in the cgds.sql schema or not. I think we decided that we would give guidance but let the end users have freedom to specify what character set and engine configuration that they wished to select. Some users currently have a mix of latin1 and utf8 character sets, depending on when they migrated their databases and what version of mysql they were running at the time.

If we want to take control over the character set (building it in to the cgds scripts explicitly) me might want to consider whether we want to specify it for all tables. Also consider that tables have a default character set, but each field in a table can have its own character set. So migrating from one character set to another (in a smart way) may involve converting the character set for all fields and also setting a default character set for the table. This could get messy if we want the migration script to work correctly for all implementations (such as when the installation has a mixture of character sets). Handling/checking conversion of utf8 to latin1 may also be something to consider .. since a utf8 value may not be convertible into latin1. I think there are some built in conversion policy choices available in mysql.

jjgao commented 6 years ago

@sheridancbio without making it too complicated, I think it's probably ok to switch to latin1 for mutation_event. I don't think non-ASCII characters would be used in this table.

Let's not touch other tables for now?

pieterlukasse commented 6 years ago

@jjgao I agree to switch to latin1, only for this table.

fedde-s commented 6 years ago

As far as I'm aware, the text fields in this table tend to follow restricted ASCII-only schemas on which the validator, loader, web API or frontend depend? Although I'm particularly unsure about the KEYWORDS field.

khzhu commented 6 years ago

I've done my testing on importer scripts, web APIs (with some minor adjustments, mysql 8 requires a quote around fields that are the same as those mysql reserved keywords such as "rank" or "groups"), and frontend react codes, are not affected by this change. Only thing I noticed was H2 database used by MyBatis persistence test suites not working with the new MySql 8 JDBC driver, so is the Maven SQL plug-in not happy with the change. Need to investigate more on those issues.

sheridancbio commented 6 years ago

If there is consensus that we want to give an explicit 'latin1' character set for particular fields in mutation_event, then I don't want to stand in the way. But there are a few things to keep in mind:

I think there is a fundamental unaddressed data representation issue here. What is our definition of a mutation? How many nucleotides can be deleted or inserted? I remember searching MAF files and seeing entries that had thousands of inserted or deleted nucleotides listed in the alleles. With a very long deletion or insertion, the nucleotide string is just not going to fit into this table. I think originally the field we are discussing had length 255, but was lengthened (by @angelicaochoa I believe) to allow us to capture longer indels. If the length had remained 255 we would not have a problem ... and one possible solution to the current problem without specifying a character set would be to reduce the field length to 255. When I did that for our production database (which is utf8 innodb) I checked whether we were "losing" any long indels and we were not. If we answer the question "What is the longest length insert or delete that we will accept as a mutation event?" maybe the answer is 255 or less ... and if it is more that 255 then we should be doing some thinking about what the actual limit is. If we want to be able to represent an insertion of 10000 NT then we will need another mechanism for doing that anyway --- and solving that problem would probably also solve this problem. If the answer is something between 255 and 768 then I guess the latin1 encoding could be the right way to go -- but then at least then we would know to code up the validator to reject tumor_seq_allele entries that have 800 NT.
if we go forward with specifying latin1, we should be careful. In mysql there is both a way to specify the default character set for a table, and also a way to specify the character set for particular fields. There can be a mix of character sets in the table ... some fields having latin1 and some fields having utf8. So be sure to know whether we are taking control over the character set for particular fields, and also whether we are taking control over the default character set for the table. This will affect the future evolution of the mutation event table .. because newly added fields will receive the default character set for the table if we don't explicitly define the character set when adding new fields.
migration may be tricky. when writing the migration code in migration.sql, the code will need to work on a variety of databases ... some where the end user's database already has these fields as latin1, and some where the fields are utf8. There are conversion functions that convert fields between latin1 and utf8, but they may be fragile --- particularly when converting from utf8 to latin1 (because not every utf8 character is representable in latin1). Maybe you can argue that "utf8 representation of these fields was never supported in the first place ... because the index component bit limit would have prevented it" .. but I think we gave installation guidance that discussed these issues and suggested reducing the field length to allow using utf8 character set encoding.
we wanted to keep things simple here at mskcc, so we chose a policy of "all tables will be innodb engine and all text fields will be utf8". It's not too big of a deal for us to give up on that simplicity and allow some fields to be forced to latin1 ... but I wonder if there might be a functional reason why someone wanted to have utf8 characters in their allele strings? By forcing latin1 we are preventing that.

Again .. I'm not trying to obstruct progress. Just trying to share the considerations that occur to me.

n1zea144 commented 6 years ago

Sorry I'm late to the party. I think Rob did a good job describing our thought process and considerations when we last visited this topic here at MSK.

I think changing the definition of this table to address this issue is a bit hackish and the real solutions are along the lines of what Fedde and Rob have suggested - that is to address the case when we have larger mutation events.

In general, I'd prefer not to take control (or selective control) of the engine/charset and let the local groups decide what makes sense their installation. I think providing documentation which describes this issue and how to resolve it would suffice - in fact, this user in 2478 figured out the problem on their own. I'd rather reduce the length of the varchar fields in mutation_event table than specify the charset.

My two cents....

jjgao commented 6 years ago

Thanks, @sheridancbio and @n1zea144 .

@khzhu @fedde-s @priti88: would you please check what's the longest length of TUMOR_SEQ_ALLELE in your local databases? I am thinking maybe the simplest solution is reducing it to 255.

khzhu commented 6 years ago

Hi @jjgao , the longest one we have is 144 bytes.

mysql> select max(length(TUMOR_SEQ_ALLELE)) from mutation_event;
+-------------------------------+
| max(length(TUMOR_SEQ_ALLELE)) |
+-------------------------------+
|                           144 |
+-------------------------------+
1 row in set (0.03 sec)

pieterlukasse commented 6 years ago

@jjgao we will have to run this by each of the installations, some of which we don't have direct access to. Will take some time... But if it was increased to 400 by @angelicaochoa (https://github.com/cBioPortal/cbioportal/commit/8bff1f3bfeb8612b80712b44e52eb45adfcfe6ce) , then there must be some cases on your side that are > 255, right?

pieterlukasse commented 6 years ago

@jjgao got a reply from one of our clients: max length is 400....

sheridancbio commented 6 years ago

@pieterlukasse We think that the "long" indel cases were from a particular dataset that was evaluated in our portal but not deployed as a regular study. That study is not present in our production databases (afaik).

Separately, if your client had a TUMOR_SEQ_ALLELE field with 400 NT in it, it seems very likely that they had other mutation events which were longer and were either truncated (perhaps misleadingly) to fit into the 400 character max field size, or were dropped because they could not be properly imported into the mutation event table. Is there a way that we could find out the maximum size indel that was present in their source data (their MAF files)?

khzhu commented 6 years ago

Hi @jjgao @n1zea144 @sheridancbio @pieterlukasse @fedde-s I got mysql8 support code ready for your review. maven SQL plug0in issues solved, all unit tests passed. Attached screen shot shows you the fields I adjusted to limit the unique key to 3072 bytes. With a few changes to the code base, we can upgrade from mySQL 5.7 to mySQL 8 to empower our portal with a faster and more reliable database. At PM, We have been using mySQL 8 for almost two months no noticeable issues reported.

jjgao commented 6 years ago

Thank you, @khzhu.

Everyone: do we all agree that we should shorten REFERENCE_ALLELE and TUMOR_SEQ_ALLELE to 255 as the default? For portal instances with longer alleles, they would have to manually change that and change the to ascii.

The migration code can probably smart: only migrate when the max is less than 255?

khzhu commented 6 years ago

please note MUTATION_TYPE also downsized from 255 to 64 to keep the unique key not exceed 3072 bytes limit.

sheridancbio commented 6 years ago

@khzhu about MUTATION_TYPE ... I think the restriction was that any single element of the key cannot exceed 3072 bits ... not that the total of all elements must be below 3072 bits. So reducing MUTATION_TYPE may not be needed. (although maybe it would be good to reduce it anyway) I'll look for you in the production databases and see what the longest string length is in our mutation_event table for this string.

Additionally, I think that maybe the LINK_XVAR, LINK_PDB, LINK_MSA fields can be completely dropped ... these fields were for a legacy api that I think (not positive) has been replaced by calls to genome-nexus. (cc: @onursumer ?)

sheridancbio commented 6 years ago

The longest entry in field MUTATION_TYPE in the primary production database here is "splice_acceptor_variant&splice_donor_variant&intron_variant" ... 59 characters. entrez_gene_id: 26168, chr: 17, start_position: 7470286, reference_allele: A, tumor_seq_allele: C, db_snp_rs: rs111576270;COSM4429607

khzhu commented 6 years ago

Hi @sheridancbio , the total of all elements of unique key must be below 3072 bytes, so we have to either downsize MUTATION_TYPE or PROTEIN_CHANGE from 255 to 64. MUTATION_TYPE probably not needs that big. Please let me know, thank you!

khzhu commented 6 years ago

@fedde-s , can you please let us know the longest MUTATION_TYPE you have?

fedde-s commented 6 years ago

@fedde-s , can you please let us know the longest MUTATION_TYPE you have?

I'm currently on vacation. Pieter might be able to, but even if he can it will take time as we maintain many instances, and not all of them can be readily inspected from our side. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

onursumer commented 6 years ago

@sheridancbio It should be safe to drop LINK_XVAR, LINK_PDB, and LINK_MSA. We are getting the mutation assessor data via Genome Nexus. We just need to make sure that we don't break the current cBioPortal API.

sheridancbio commented 6 years ago

@khzhu I ran a quick test and you are correct about the limit on overall (aggregate) length of the unique key. I'm not sure how we are adding up to 3072 bytes however. I see from this webpage: https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-conversion.html that utf8 can be represented in either a 3byte per character of a 4byte per character encoding. But the same webpage says that if you use a 4 byte encoding then the maximum length of a varchar field that is part of an index would be 191 instead of 255, so using a 4 byte encoding would not allow for TUMOR_SEQ_ALLELE to be varchar(255). If you are using a 3 byte encoding then I think the math looks something like this: UNIQUE (CHR, START_POSITION, END_POSITION, TUMOR_SEQ_ALLELE, ENTREZ_GENE_ID, PROTEIN_CHANGE, MUTATION_TYPE), 15 bytes : CHR varchar(5) 8 bytes : START_POSITION bigint(20), 8 bytes : END_POSITION bigint(20), 765 bytes : TUMOR_SEQ_ALLELE varchar(255), 4 bytes : ENTREZ_GENE_ID int(11), 765 bytes : PROTEIN_CHANGE varchar(255), 765 bytes : MUTATION_TYPE varchar(255)

total size : 2330 bytes

khzhu commented 6 years ago

@sheridancbio , by default MySQL uses utf8mb4 to support 4 bytes UTF8 characters/emoji, so (5(chr)+255(TUMOR_SEQ_ALLELE)+255(PROTEIN_CHANGE)+255(MUTATION_TYPE) )*4 = 3080 already > 3072.

mysql> SHOW VARIABLES LIKE '%utf8%';
+-------------------------------+--------------------+
| Variable_name                 | Value              |
+-------------------------------+--------------------+
| default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci |
+-------------------------------+--------------------+
1 row in set (0.00 sec)

khzhu commented 6 years ago

some interesting stats to share.

jjgao commented 6 years ago

Since we have so much trouble with this unique key: UNIQUE (CHR, START_POSITION, END_POSITION, TUMOR_SEQ_ALLELE, ENTREZ_GENE_ID, PROTEIN_CHANGE, MUTATION_TYPE), do we really need it?

I can see two reasons: 1) data integrity; 2) performance.

for # 1, I think we can ensure it in the import script. And I think we are doing that.

For # 2, I don't think we are pulling data based on this key. Are we?

khzhu commented 6 years ago

That is a good question. I personally trust more on the unique key to provide the uniqueness for a set of columns. Besides the unique index also improves an estimate of join cardinality when you have a multi-column.

jjgao commented 6 years ago

@khzhu I understand your first point but the uniqueness has to be implemented in the import script. The DB constraint is just to enforce it (throw an error if the constraint is violated) -- maybe a database validation script can do the trick? And I don't think it'll have a big impact in term of features even if they are not unique.

I don't really understand the second point. We are not using this key for joining other tables if that's what you mean.

I agree that if the unique key costs nothing, it does not hurt to have it, but looking at the issue it has created and all the discussions and time we spent :)... So the real question is whether there is a justifiable benefit to keep it?

Maybe we could just try removing it and look at the impact?

@n1zea144 @sheridancbio @pieterlukasse @fedde-s: thoughts?

khzhu commented 6 years ago

Hi @jjgao, when you run explain query or query optimizer, the relation cardinality will be looked up to generate underlying SQL join query reports. Those reports often show you how to optimize or improve your queries.

jjgao commented 6 years ago

I still think we should try to remove UNIQUE (CHR, START_POSITION, END_POSITION, TUMOR_SEQ_ALLELE, ENTREZ_GENE_ID, PROTEIN_CHANGE, MUTATION_TYPE). We could try it on one of our production databases and see if there is any impact, and then make it into the migrations sql.

Anyone disagree?

sheridancbio commented 6 years ago

I do not disagree. (and I agree ... at least I think I would prefer this option rather than trying to manage the character encoding in the migration scripts). We could maybe also add code to the DaoMutation class (and/or the bulk loader) to check for collisions on these index fields before records get imported. We might speed that up by using the existing key for ENTREZ_GENE_ID, subsetting the existing table first on mutations for just for the involved genes, then scanning the records against a constructed hash table for the records to be imported (hashing a string concatenation of the fields which we currently have in the unique key). If we want to make this even faster, we could add another index, such as KEY(ENTREZ_GENE_ID, PROTEIN_CHANGE) and subset the existing table records using that.

jjgao commented 6 years ago

@sheridancbio Thanks! Could you add this to your backlog? Adding keys is also a good idea.

sheridancbio commented 5 years ago

@jjgao @khzhu This issue has resurfaced here at MSKCC. We are now receiving (from our automated dmp sequencing pipeline feed) indels of length 1023. So we see cases where REFERENCE_ALLELE and TUMOR_SEQ_ALLELE1 in our imported mutation file (data_mutations_extended.txt) are over the limit of varchar(400) {the official cgds.sql field length} not to mention the varchar(255) limit that we have been using for our production databases. In order to store the full information for these longer events, we think a more fundamental change is needed in the database schema.

Talking it over, @angelicaochoa had the idea proposed by @fedde-s earlier in this thread (https://github.com/cBioPortal/cbioportal/issues/4345#issuecomment-395793231) .. that we should have a separate table holding allele strings. Those allele strings would be of mysql type "text" so there would be no upper limit on length, and there would be a simple numeric primary key. The allele strings would be referenced in the tables mutation and mutation_event by the numeric primary key and the allele table would hold unique strings.

This would add a small performance overhead for the additional join, but we believe that should be minor.

Doing this would eliminate the problem of having an upper limit on allele lengths. We would like to develop a new PR along these lines, which would incorporate any other changes from #4396 from @khzhu in order to make the database installable under mysql release >= 8.0 with the default schema in cgds.sql, without requiring us to take control over the character encoding for tables/fields.

All comments welcome. We will put the effort of developing this approach in the backlog for the pipelines scrum team, but want to see if there is general approval for this approach. We can discuss pros / cons / alternatives here too.

cc: @n1zea144 @averyniceday

jjgao commented 5 years ago

@sheridancbio @n1zea144 I still think we should just remove the UNIQUE key instead of complicating the schema.

sheridancbio commented 5 years ago

@jjgao Do you think we should truncate the reported Reference_Allele values in data_mutation_extended.txt to fit inside the 255 character column? Or should we filter out deletions where the Reference_Allele is over 255 characters during import? We will not be able to store these mutations in the current table.

jjgao commented 5 years ago

@sheridancbio that wasn't what I meant. I thought if we remove the UNIQUE constraint, we will solve the issue documented here and then we could use as many characters as we want. If I did not misunderstand, the problem is that the UNIQUE key is too large to be accepted by MySQL. So we could do:

delete the line UNIQUE (CHR, START_POSITION, END_POSITION, TUMOR_SEQ_ALLELE, ENTREZ_GENE_ID, PROTEIN_CHANGE, MUTATION_TYPE) https://github.com/cBioPortal/cbioportal/blob/master/db-scripts/src/main/resources/cgds.sql#L424
Increase TUMOR_SEQ_ALLELE to varchar(5000)

sheridancbio commented 5 years ago

@jjgao I think I understand. You would prefer to keep the allele values for long indels inside the mutation and mutation_event tables rather than externalizing them to a separate table.

If we did that, we may run into the maximum record size limit ... see https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html in the paragraph starting with "The effective maximum number of bytes that can be stored in a VARCHAR or VARBINARY column is subject to the maximum row size of 65,535 bytes...". Because UTF8 characters are multi-byte, the total number of characters that can fit in a single record may be closer to 21845 or 16348. The mutation table has several fields that can hold allele strings from the MAF file: REFERENCE_ALLELE, TUMOR_SEQ_ALLELE1, TUMOR_SEQ_ALLELE2, MATCH_NORM_SEQ_ALLELE1, MATCH_NORM_SEQ_ALLELE2, TUMOR_VALIDATION_ALLELE1, TUMOR_VALIDATION_ALLELE2, MATCH_NORM_VALIDATION_ALLELE1, MATCH_NORM_VALIDATION_ALLELE2. If all of these fields held varchar(5000) we would run into the maximum record size problem. Also, we would again have a problem when we saw a deletion of 6000nt in the MAF. @angelicaochoa has done some testing with a field type of "text" for allele strings. I'll test whether text fields have the same maximum record size problem. If not, maybe using the "text" type for these fields and removing the UNIQUE key would be a good compromise.

sheridancbio commented 5 years ago

I have confirmed that text fields avoid the maximum row/record size issue. I was able to create a table with three text fields and put 20000 utf-8 encoded characters into the first two fields and 30000 utf-8 encoded characters into the third field.

When trying to create a table with three varchar fields of length 10000, I got this error: mysql> create table test (f1 varchar(10000), f2 varchar(10000), f3 varchar(10000)) default charset=utf8; ERROR 1118 (42000): Row size too large. The maximum row size for the used table type, not counting BLOBs, is 65535. This includes storage overhead, check the manual. You have to change some columns to TEXT or BLOBs

jjgao commented 5 years ago

@sheridancbio Thanks for the information and testing it out -- mysql is so complex. I think it's ok to switch to TEXT as long as the performance is not affected.

sheridancbio commented 5 years ago

From they mysql documentation (see https://dev.mysql.com/doc/refman/8.0/en/blob.html) the one performance hit to worry about is when querying a temporary table which contains TEXT or BLOB elements. I think this would be when you do something like "select ... from (select ... from mutation join sample on ... where ...)" The parenthesized sub expression would be the temporary table. I'm guessing that our main persistence-mybatis code does not do this, but I have not yet scanned it. cc: @angelicaochoa @jjgao

cBioPortal / cbioportal

Database schema is incompatible with utf8, the default in MySQL 8 #4345