Closed nayib-jose-gloria closed 11 months ago
What is the reason for feature_names to be unique? One reason we require Ensembl IDs is that we know gene symbols are not unique, so we should have anticipated this.
Have other options been considered...
(<Ensembl ID>)
to the duplicate cases)It's odd that we've never hit this case before?
@atarashansky @dsadgat - can you respond to Jason's questions above?
@brianraymor @jahilton
What is the reason for feature_names to be unique? One reason we require Ensembl IDs is that we know gene symbols are not unique, so we should have anticipated this.
This decision predates me. But looking at the explorer code, it seems that if the index isn't unique, then the gene fetch might fail as the backend could get a multiple-column array (if a gene with duplicate names was selected) whereas it expects 1 column (corresponding to 1 gene). I can't tell just based on inspection whether or not this will actually break the response.
What I can say, though, is that if there are duplicate values in the index, the behavior will always be ambiguous. If a user selects "GENE_A" and there are three such genes, which expressions is the application meant to show the user?
@brianraymor @jahilton
What is the reason for feature_names to be unique? One reason we require Ensembl IDs is that we know gene symbols are not unique, so we should have anticipated this.
This decision predates me. But looking at the explorer code, it seems that if the index isn't unique, then the gene fetch might fail as the backend could get a multiple-column array (if a gene with duplicate names was selected) whereas it expects 1 column (corresponding to 1 gene). I can't tell just based on inspection whether or not this will actually break the response.
What I can say, though, is that if there are duplicate values in the index, the behavior will always be ambiguous. If a user selects "GENE_A" and there are three such genes, which expressions is the application meant to show the user?
H5ADs have Ensembl ID as the index (which should be unique) and CXG conversion changes the index to feature name and drops Ensembl ID to a column. Do you know why that choice was made? If it was for human readability, perhaps we could offer both on the frontend and query on the Ensembl ID (i.e. "Gene Symbol [Ensembl ID]")
It was for human readability. I think that @maniarathi may be familiar with that design. Also see comment.
Human readability is correct -- it powers the autocomplete feature for Quick Search Genes and is the index that is used to get gene information for Gene Sets and Differential Expression. feature_id
is not used anywhere in the Explorer tool so is dropped in the CXG conversion. There wasn't any native support for converting the index of a dataframe to a TileDB Array at the time (may have changed today) so we needed to add explicit support to convert an index to an additional column of the dataframe while keeping the constraint that Dataframe indexes have which are that it is unique.
Reviving this thread, since it seems that we may want to determine an approach that DOES allow for duplicate feature names without invoking the CXG conversion failure, close this ticket, and open a new one for the agreed implementation. I believe we want to include this as part of 4.0.0 schema validation work, but let me know if that's not the case.
Also re-tagged as data-viz since it concerns cxg conversion.
Origin failure case that inspired this ticket: https://czi-sci.slack.com/archives/C024HCSH9PT/p1690415168797619
@signechambers1 @brianraymor
From the product perspective, I'd like to maintain the human readability of Gene Name in the front end of explorer. @dsadgat I think we need dev to figure out the right approach, can we include this in the schema 4.0 data viz work?
After some discussion in this thread, I'm going to close out this ticket since it seems we almost certainly don't want to go down the path of having the CLI validator fail. I've opened another ticket to reflect our current understanding of the problem.
Example dataset that should fail validation:
Causes CXG conversion Failure if var.feature_names are not unique (see comments for discussion).
We should update the CLI validator to raise this issue, and make it clear to the user the cause of the failure.