Closed fedde-s closed 4 years ago
Thanks, @fedde-s.
@khzhu you mentioned that you are using MySQL8. Would you mind sending your changes as a pull request?
cc'ing @n1zea144 and @sheridancbio
@jjgao, @fedde-s , no problems, I will create pull request to address this. Thank you!
Thank you, @khzhu
Just catching up with this issue again. It sounds like we are moving towards reducing the tumor_seq_allele field width to 255 characters (which is something we did in our production databases, and diverges from the schema specified in cgds.sql). I think this is probably fine, but I think the reason the length of this field was originally raised above 255 was to capture some very long insertion / deletion events that we were getting through sequencing. I think @angelicaochoa made these adjustments in response to some missing or truncated mutation events we were processing.
I guess in conjunction with this reduction to 255 we should ask the question : what do we do when a mutation event is imported that has an insertion of more than 255 nucleotides? We could make this part of the study validator ... that no mutation event contains an indel with more than this limit of nucleotides reported. Maybe an insert above a certain length gets converted into a structural variant? Or we can store very long insertions in a separate table perhaps? We probably should not be silently truncating or discarding such events. This discussion should probably be in a separate issue. cc: @n1zea144
I briefly considered moving the field to a separate two-column table with a numeric primary key. That way the composite key could include the (much smaller) foreign key, and uniqueness of the string itself could be ensured by a single-column constraint, if that makes a difference. Your ideas about structural variants and involving the validator sound promising too though!
The quickest fix to write would presumably be to make the validator (and documentation in File-Formats.md
) disallow studies that contain 256+-character mutations, pushing the decision of whether to represent them in some different way or discard them to the data curators.
What would happen if we remove the unique key constraint?
@jjgao @fedde-s @sheridancbio , I think we should keep the unique key constraint because I have a solution for this. Let me explain in detail.
The character limit depends on the character set you use, If you use latin1 then the largest column you can index is varchar(767=255*3+2), but if you use utf8 then the limit becomes varchar(255). By default, mysql 8 uses utf8.
mysql> SHOW VARIABLES LIKE 'character_set_system';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| character_set_system | utf8 |
+----------------------+-------+
1 row in set (0.00 sec)
There is also a separate 3072 byte limit per index, so you can include multiple columns (each 255 or smaller) up to 3072 total bytes. But a column can be no longer than 255 bytes if you use utf8.
Since, the mutation_event violets this rule, we got an "Specified key was too long; max key length is 3072 bytes" when running the cgds.sql script.
mysql> CREATE TABLE `mutation_event` ( `MUTATION_EVENT_ID` int(255) NOT NULL auto_increment, `ENTREZ_GENE_ID` int(11) NOT NULL, `CHR` varchar(5), `START_POSITION` bigint(20), `END_POSITION` bigint(20), `REFERENCE_ALLELE` varchar(400), `TUMOR_SEQ_ALLELE` varchar(400), `PROTEIN_CHANGE` varchar(255), `MUTATION_TYPE` varchar(255) COMMENT 'e.g. Missense, Nonsence, etc.', `FUNCTIONAL_IMPACT_SCORE` varchar(50) COMMENT 'Result from OMA/XVAR.', `FIS_VALUE` float, `LINK_XVAR` varchar(500) COMMENT 'Link to OMA/XVAR Landing Page for the specific mutation.', `LINK_PDB` varchar(500), `LINK_MSA` varchar(500), `NCBI_BUILD` varchar(10), `STRAND` varchar(2), `VARIANT_TYPE` varchar(15), `DB_SNP_RS` varchar(25), `DB_SNP_VAL_STATUS` varchar(255), `ONCOTATOR_DBSNP_RS` varchar(255), `ONCOTATOR_REFSEQ_MRNA_ID` varchar(64), `ONCOTATOR_CODON_CHANGE` varchar(255), `ONCOTATOR_UNIPROT_ENTRY_NAME` varchar(64), `ONCOTATOR_UNIPROT_ACCESSION` varchar(64), `ONCOTATOR_PROTEIN_POS_START` int(11), `ONCOTATOR_PROTEIN_POS_END` int(11), `CANONICAL_TRANSCRIPT` boolean, `KEYWORD` varchar(50) DEFAULT NULL COMMENT 'e.g. truncating, V200 Missense, E338del, ', KEY (`KEYWORD`), PRIMARY KEY (`MUTATION_EVENT_ID`), UNIQUE (`CHR`, `START_POSITION`, `END_POSITION`, `TUMOR_SEQ_ALLELE`, `ENTREZ_GENE_ID`, `PROTEIN_CHANGE`, `MUTATION_TYPE`), FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`) );
ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes
So, the solution is to overwrite latin1 character set for the mutation_event table. (I could not really think of any reason to have utf8 to this table.) This will allow us to have 767 bytes per column, including multiple columns (each 767 bytes or smaller) up to a total of 3072 bytes per index.
kelseyzhu@tpugh28 cbioportal (master) $ git diff db-scripts/src/main/resources/cgds.sql
diff --git a/db-scripts/src/main/resources/cgds.sql b/db-scripts/src/main/resources/cgds.sql
index 97160821f..7dd1b7d75 100644
--- a/db-scripts/src/main/resources/cgds.sql
+++ b/db-scripts/src/main/resources/cgds.sql
@@ -425,7 +425,7 @@ CREATE TABLE `mutation_event` (
PRIMARY KEY (`MUTATION_EVENT_ID`),
UNIQUE (`CHR`, `START_POSITION`, `END_POSITION`, `TUMOR_SEQ_ALLELE`, `ENTREZ_GENE_ID`, `PROTEIN_CHANGE`, `MUTATION_TYPE`),
FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`)
-) COMMENT='Mutation Data';
+) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
Hope this makes sense. The pull request should be ready on Monday for your review. Thank you!
@khzhu Thank you! Very well explained! I don't see a reason to use utf8 for this table either. Could you create a pull request for the database schema change with sql migration code?
@jjgao ,thank you! I will work on this next week.
Nice! That way the institute-specific text fields (such as clinical attributes and their string values, or study names and descriptions, or even cancer type names) can use UTF-8, while this particular column (which I believe should hold a variation on the ASCII-only HGVS protein change notation extended with a _SPLICE
notation that used to be produced by MSK's pipeline) can keep working as it used to.
I remember that we debated whether to "take control" of the engine and character set, specifying it in the cgds.sql schema or not. I think we decided that we would give guidance but let the end users have freedom to specify what character set and engine configuration that they wished to select. Some users currently have a mix of latin1 and utf8 character sets, depending on when they migrated their databases and what version of mysql they were running at the time.
If we want to take control over the character set (building it in to the cgds scripts explicitly) me might want to consider whether we want to specify it for all tables. Also consider that tables have a default character set, but each field in a table can have its own character set. So migrating from one character set to another (in a smart way) may involve converting the character set for all fields and also setting a default character set for the table. This could get messy if we want the migration script to work correctly for all implementations (such as when the installation has a mixture of character sets). Handling/checking conversion of utf8 to latin1 may also be something to consider .. since a utf8 value may not be convertible into latin1. I think there are some built in conversion policy choices available in mysql.
@sheridancbio without making it too complicated, I think it's probably ok to switch to latin1 for mutation_event
. I don't think non-ASCII characters would be used in this table.
Let's not touch other tables for now?
@jjgao I agree to switch to latin1, only for this table.
As far as I'm aware, the text fields in this table tend to follow restricted ASCII-only schemas on which the validator, loader, web API or frontend depend? Although I'm particularly unsure about the KEYWORDS
field.
I've done my testing on importer scripts, web APIs (with some minor adjustments, mysql 8 requires a quote around fields that are the same as those mysql reserved keywords such as "rank" or "groups"), and frontend react codes, are not affected by this change. Only thing I noticed was H2 database used by MyBatis persistence test suites not working with the new MySql 8 JDBC driver, so is the Maven SQL plug-in not happy with the change. Need to investigate more on those issues.
If there is consensus that we want to give an explicit 'latin1' character set for particular fields in mutation_event, then I don't want to stand in the way. But there are a few things to keep in mind:
Again .. I'm not trying to obstruct progress. Just trying to share the considerations that occur to me.
Sorry I'm late to the party. I think Rob did a good job describing our thought process and considerations when we last visited this topic here at MSK.
I think changing the definition of this table to address this issue is a bit hackish and the real solutions are along the lines of what Fedde and Rob have suggested - that is to address the case when we have larger mutation events.
In general, I'd prefer not to take control (or selective control) of the engine/charset and let the local groups decide what makes sense their installation. I think providing documentation which describes this issue and how to resolve it would suffice - in fact, this user in 2478 figured out the problem on their own. I'd rather reduce the length of the varchar fields in mutation_event table than specify the charset.
My two cents....
Thanks, @sheridancbio and @n1zea144 .
@khzhu @fedde-s @priti88: would you please check what's the longest length of TUMOR_SEQ_ALLELE
in your local databases? I am thinking maybe the simplest solution is reducing it to 255.
Hi @jjgao , the longest one we have is 144 bytes.
mysql> select max(length(TUMOR_SEQ_ALLELE)) from mutation_event;
+-------------------------------+
| max(length(TUMOR_SEQ_ALLELE)) |
+-------------------------------+
| 144 |
+-------------------------------+
1 row in set (0.03 sec)
@jjgao we will have to run this by each of the installations, some of which we don't have direct access to. Will take some time... But if it was increased to 400 by @angelicaochoa (https://github.com/cBioPortal/cbioportal/commit/8bff1f3bfeb8612b80712b44e52eb45adfcfe6ce) , then there must be some cases on your side that are > 255, right?
@jjgao got a reply from one of our clients: max length is 400....
@pieterlukasse We think that the "long" indel cases were from a particular dataset that was evaluated in our portal but not deployed as a regular study. That study is not present in our production databases (afaik).
Separately, if your client had a TUMOR_SEQ_ALLELE field with 400 NT in it, it seems very likely that they had other mutation events which were longer and were either truncated (perhaps misleadingly) to fit into the 400 character max field size, or were dropped because they could not be properly imported into the mutation event table. Is there a way that we could find out the maximum size indel that was present in their source data (their MAF files)?
Hi @jjgao @n1zea144 @sheridancbio @pieterlukasse @fedde-s I got mysql8 support code ready for your review. maven SQL plug0in issues solved, all unit tests passed. Attached screen shot shows you the fields I adjusted to limit the unique key to 3072 bytes. With a few changes to the code base, we can upgrade from mySQL 5.7 to mySQL 8 to empower our portal with a faster and more reliable database. At PM, We have been using mySQL 8 for almost two months no noticeable issues reported.
Thank you, @khzhu.
Everyone: do we all agree that we should shorten REFERENCE_ALLELE
and TUMOR_SEQ_ALLELE
to 255 as the default? For portal instances with longer alleles, they would have to manually change that and change the to ascii.
The migration code can probably smart: only migrate when the max is less than 255?
please note MUTATION_TYPE also downsized from 255 to 64 to keep the unique key not exceed 3072 bytes limit.
@khzhu about MUTATION_TYPE ... I think the restriction was that any single element of the key cannot exceed 3072 bits ... not that the total of all elements must be below 3072 bits. So reducing MUTATION_TYPE may not be needed. (although maybe it would be good to reduce it anyway) I'll look for you in the production databases and see what the longest string length is in our mutation_event table for this string.
Additionally, I think that maybe the LINK_XVAR, LINK_PDB, LINK_MSA fields can be completely dropped ... these fields were for a legacy api that I think (not positive) has been replaced by calls to genome-nexus. (cc: @onursumer ?)
The longest entry in field MUTATION_TYPE in the primary production database here is "splice_acceptor_variant&splice_donor_variant&intron_variant" ... 59 characters. entrez_gene_id: 26168, chr: 17, start_position: 7470286, reference_allele: A, tumor_seq_allele: C, db_snp_rs: rs111576270;COSM4429607
Hi @sheridancbio , the total of all elements of unique key must be below 3072 bytes, so we have to either downsize MUTATION_TYPE or PROTEIN_CHANGE from 255 to 64. MUTATION_TYPE probably not needs that big. Please let me know, thank you!
@fedde-s , can you please let us know the longest MUTATION_TYPE you have?
@fedde-s , can you please let us know the longest MUTATION_TYPE you have?
I'm currently on vacation. Pieter might be able to, but even if he can it will take time as we maintain many instances, and not all of them can be readily inspected from our side. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
@sheridancbio It should be safe to drop LINK_XVAR
, LINK_PDB
, and LINK_MSA
. We are getting the mutation assessor data via Genome Nexus. We just need to make sure that we don't break the current cBioPortal API.
@khzhu I ran a quick test and you are correct about the limit on overall (aggregate) length of the unique key.
I'm not sure how we are adding up to 3072 bytes however. I see from this webpage: https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-conversion.html
that utf8 can be represented in either a 3byte per character of a 4byte per character encoding. But the same webpage says that if you use a 4 byte encoding then the maximum length of a varchar field that is part of an index would be 191 instead of 255, so using a 4 byte encoding would not allow for TUMOR_SEQ_ALLELE to be varchar(255). If you are using a 3 byte encoding then I think the math looks something like this:
UNIQUE (CHR
, START_POSITION
, END_POSITION
, TUMOR_SEQ_ALLELE
, ENTREZ_GENE_ID
, PROTEIN_CHANGE
, MUTATION_TYPE
),
15 bytes : CHR
varchar(5)
8 bytes : START_POSITION
bigint(20),
8 bytes : END_POSITION
bigint(20),
765 bytes : TUMOR_SEQ_ALLELE
varchar(255),
4 bytes : ENTREZ_GENE_ID
int(11),
765 bytes : PROTEIN_CHANGE
varchar(255),
765 bytes : MUTATION_TYPE
varchar(255)
total size : 2330 bytes
@sheridancbio , by default MySQL uses utf8mb4 to support 4 bytes UTF8 characters/emoji, so (5(chr)+255(TUMOR_SEQ_ALLELE)+255(PROTEIN_CHANGE)+255(MUTATION_TYPE) )*4 = 3080 already > 3072.
mysql> SHOW VARIABLES LIKE '%utf8%';
+-------------------------------+--------------------+
| Variable_name | Value |
+-------------------------------+--------------------+
| default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci |
+-------------------------------+--------------------+
1 row in set (0.00 sec)
some interesting stats to share.
Since we have so much trouble with this unique key: UNIQUE (CHR, START_POSITION, END_POSITION, TUMOR_SEQ_ALLELE, ENTREZ_GENE_ID, PROTEIN_CHANGE, MUTATION_TYPE)
, do we really need it?
I can see two reasons: 1) data integrity; 2) performance.
for # 1, I think we can ensure it in the import script. And I think we are doing that.
For # 2, I don't think we are pulling data based on this key. Are we?
That is a good question. I personally trust more on the unique key to provide the uniqueness for a set of columns. Besides the unique index also improves an estimate of join cardinality when you have a multi-column.
@khzhu I understand your first point but the uniqueness has to be implemented in the import script. The DB constraint is just to enforce it (throw an error if the constraint is violated) -- maybe a database validation script can do the trick? And I don't think it'll have a big impact in term of features even if they are not unique.
I don't really understand the second point. We are not using this key for joining other tables if that's what you mean.
I agree that if the unique key costs nothing, it does not hurt to have it, but looking at the issue it has created and all the discussions and time we spent :)... So the real question is whether there is a justifiable benefit to keep it?
Maybe we could just try removing it and look at the impact?
@n1zea144 @sheridancbio @pieterlukasse @fedde-s: thoughts?
Hi @jjgao, when you run explain query or query optimizer, the relation cardinality will be looked up to generate underlying SQL join query reports. Those reports often show you how to optimize or improve your queries.
I still think we should try to remove UNIQUE (CHR, START_POSITION, END_POSITION, TUMOR_SEQ_ALLELE, ENTREZ_GENE_ID, PROTEIN_CHANGE, MUTATION_TYPE)
. We could try it on one of our production databases and see if there is any impact, and then make it into the migrations sql.
Anyone disagree?
I do not disagree. (and I agree ... at least I think I would prefer this option rather than trying to manage the character encoding in the migration scripts). We could maybe also add code to the DaoMutation class (and/or the bulk loader) to check for collisions on these index fields before records get imported. We might speed that up by using the existing key for ENTREZ_GENE_ID, subsetting the existing table first on mutations for just for the involved genes, then scanning the records against a constructed hash table for the records to be imported (hashing a string concatenation of the fields which we currently have in the unique key). If we want to make this even faster, we could add another index, such as KEY(ENTREZ_GENE_ID, PROTEIN_CHANGE) and subset the existing table records using that.
@sheridancbio Thanks! Could you add this to your backlog? Adding keys is also a good idea.
@jjgao @khzhu This issue has resurfaced here at MSKCC. We are now receiving (from our automated dmp sequencing pipeline feed) indels of length 1023. So we see cases where REFERENCE_ALLELE and TUMOR_SEQ_ALLELE1 in our imported mutation file (data_mutations_extended.txt) are over the limit of varchar(400) {the official cgds.sql field length} not to mention the varchar(255) limit that we have been using for our production databases. In order to store the full information for these longer events, we think a more fundamental change is needed in the database schema.
Talking it over, @angelicaochoa had the idea proposed by @fedde-s earlier in this thread (https://github.com/cBioPortal/cbioportal/issues/4345#issuecomment-395793231) .. that we should have a separate table holding allele strings. Those allele strings would be of mysql type "text" so there would be no upper limit on length, and there would be a simple numeric primary key. The allele strings would be referenced in the tables mutation
and mutation_event
by the numeric primary key and the allele table would hold unique strings.
This would add a small performance overhead for the additional join, but we believe that should be minor.
Doing this would eliminate the problem of having an upper limit on allele lengths. We would like to develop a new PR along these lines, which would incorporate any other changes from #4396 from @khzhu in order to make the database installable under mysql release >= 8.0 with the default schema in cgds.sql, without requiring us to take control over the character encoding for tables/fields.
All comments welcome. We will put the effort of developing this approach in the backlog for the pipelines scrum team, but want to see if there is general approval for this approach. We can discuss pros / cons / alternatives here too.
cc: @n1zea144 @averyniceday
@sheridancbio @n1zea144 I still think we should just remove the UNIQUE
key instead of complicating the schema.
@jjgao Do you think we should truncate the reported Reference_Allele values in data_mutation_extended.txt to fit inside the 255 character column? Or should we filter out deletions where the Reference_Allele is over 255 characters during import? We will not be able to store these mutations in the current table.
@sheridancbio that wasn't what I meant. I thought if we remove the UNIQUE
constraint, we will solve the issue documented here and then we could use as many characters as we want. If I did not misunderstand, the problem is that the UNIQUE
key is too large to be accepted by MySQL. So we could do:
UNIQUE (CHR, START_POSITION, END_POSITION, TUMOR_SEQ_ALLELE, ENTREZ_GENE_ID, PROTEIN_CHANGE, MUTATION_TYPE)
https://github.com/cBioPortal/cbioportal/blob/master/db-scripts/src/main/resources/cgds.sql#L424TUMOR_SEQ_ALLELE
to varchar(5000)
@jjgao I think I understand. You would prefer to keep the allele values for long indels inside the mutation and mutation_event tables rather than externalizing them to a separate table.
If we did that, we may run into the maximum record size limit ... see https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html in the paragraph starting with "The effective maximum number of bytes that can be stored in a VARCHAR or VARBINARY column is subject to the maximum row size of 65,535 bytes...". Because UTF8 characters are multi-byte, the total number of characters that can fit in a single record may be closer to 21845 or 16348. The mutation table has several fields that can hold allele strings from the MAF file: REFERENCE_ALLELE, TUMOR_SEQ_ALLELE1, TUMOR_SEQ_ALLELE2, MATCH_NORM_SEQ_ALLELE1, MATCH_NORM_SEQ_ALLELE2, TUMOR_VALIDATION_ALLELE1, TUMOR_VALIDATION_ALLELE2, MATCH_NORM_VALIDATION_ALLELE1, MATCH_NORM_VALIDATION_ALLELE2. If all of these fields held varchar(5000) we would run into the maximum record size problem. Also, we would again have a problem when we saw a deletion of 6000nt in the MAF. @angelicaochoa has done some testing with a field type of "text" for allele strings. I'll test whether text fields have the same maximum record size problem. If not, maybe using the "text" type for these fields and removing the UNIQUE key would be a good compromise.
I have confirmed that text fields avoid the maximum row/record size issue. I was able to create a table with three text fields and put 20000 utf-8 encoded characters into the first two fields and 30000 utf-8 encoded characters into the third field.
When trying to create a table with three varchar fields of length 10000, I got this error: mysql> create table test (f1 varchar(10000), f2 varchar(10000), f3 varchar(10000)) default charset=utf8; ERROR 1118 (42000): Row size too large. The maximum row size for the used table type, not counting BLOBs, is 65535. This includes storage overhead, check the manual. You have to change some columns to TEXT or BLOBs
@sheridancbio Thanks for the information and testing it out -- mysql is so complex. I think it's ok to switch to TEXT as long as the performance is not affected.
From they mysql documentation (see https://dev.mysql.com/doc/refman/8.0/en/blob.html) the one performance hit to worry about is when querying a temporary table which contains TEXT or BLOB elements. I think this would be when you do something like "select ... from (select ... from mutation join sample on ... where ...)" The parenthesized sub expression would be the temporary table. I'm guessing that our main persistence-mybatis code does not do this, but I have not yet scanned it. cc: @angelicaochoa @jjgao
As found out in #2478, the
mutation_event
table contains a unique key too large to be accepted by MySQL when using the UTF-8 character set.Since this is the default encoding in the current release of MySQL, I have documented a recommendation to stick to MySQL 5.7 in #4333, along with a hint at why in case users do want to tweak the configuration.
However, I feel that we should adjust the schema to support UTF-8 and MySQL 8.x out of the box.