cellxgene-schema must validate that values in var.feature_name must be unique

nayib-jose-gloria commented 1 year ago

Example dataset that should fail validation:

https://s3.console.aws.amazon.com/s3/object/kuni-tmp?region=us-west-2&prefix=cxg_revised_FeatureNameDuplication_1cell.h5ad

Causes CXG conversion Failure if var.feature_names are not unique (see comments for discussion).

We should update the CLI validator to raise this issue, and make it clear to the user the cause of the failure.

jahilton commented 1 year ago

What is the reason for feature_names to be unique? One reason we require Ensembl IDs is that we know gene symbols are not unique, so we should have anticipated this.

Have other options been considered...

Avoid duplicate feature_names in gene files (e.g. append (<Ensembl ID>) to the duplicate cases)
Adjust Explorer (or whatever the breaking component is) to allow these

brianraymor commented 1 year ago

It's odd that we've never hit this case before?

@atarashansky @dsadgat - can you respond to Jason's questions above?

atarashansky commented 1 year ago

@brianraymor @jahilton

What is the reason for feature_names to be unique? One reason we require Ensembl IDs is that we know gene symbols are not unique, so we should have anticipated this.

This decision predates me. But looking at the explorer code, it seems that if the index isn't unique, then the gene fetch might fail as the backend could get a multiple-column array (if a gene with duplicate names was selected) whereas it expects 1 column (corresponding to 1 gene). I can't tell just based on inspection whether or not this will actually break the response.

What I can say, though, is that if there are duplicate values in the index, the behavior will always be ambiguous. If a user selects "GENE_A" and there are three such genes, which expressions is the application meant to show the user?

nayib-jose-gloria commented 1 year ago

@brianraymor @jahilton

What is the reason for feature_names to be unique? One reason we require Ensembl IDs is that we know gene symbols are not unique, so we should have anticipated this.

This decision predates me. But looking at the explorer code, it seems that if the index isn't unique, then the gene fetch might fail as the backend could get a multiple-column array (if a gene with duplicate names was selected) whereas it expects 1 column (corresponding to 1 gene). I can't tell just based on inspection whether or not this will actually break the response.

What I can say, though, is that if there are duplicate values in the index, the behavior will always be ambiguous. If a user selects "GENE_A" and there are three such genes, which expressions is the application meant to show the user?

H5ADs have Ensembl ID as the index (which should be unique) and CXG conversion changes the index to feature name and drops Ensembl ID to a column. Do you know why that choice was made? If it was for human readability, perhaps we could offer both on the frontend and query on the Ensembl ID (i.e. "Gene Symbol [Ensembl ID]")

brianraymor commented 1 year ago

It was for human readability. I think that @maniarathi may be familiar with that design. Also see comment.

maniarathi commented 1 year ago

Human readability is correct -- it powers the autocomplete feature for Quick Search Genes and is the index that is used to get gene information for Gene Sets and Differential Expression. feature_id is not used anywhere in the Explorer tool so is dropped in the CXG conversion. There wasn't any native support for converting the index of a dataframe to a TileDB Array at the time (may have changed today) so we needed to add explicit support to convert an index to an additional column of the dataframe while keeping the constraint that Dataframe indexes have which are that it is unique.

nayib-jose-gloria commented 1 year ago

Reviving this thread, since it seems that we may want to determine an approach that DOES allow for duplicate feature names without invoking the CXG conversion failure, close this ticket, and open a new one for the agreed implementation. I believe we want to include this as part of 4.0.0 schema validation work, but let me know if that's not the case.

Also re-tagged as data-viz since it concerns cxg conversion.

Origin failure case that inspired this ticket: https://czi-sci.slack.com/archives/C024HCSH9PT/p1690415168797619

@signechambers1 @brianraymor

signechambers1 commented 1 year ago

From the product perspective, I'd like to maintain the human readability of Gene Name in the front end of explorer. @dsadgat I think we need dev to figure out the right approach, can we include this in the schema 4.0 data viz work?

joyceyan commented 11 months ago

After some discussion in this thread, I'm going to close out this ticket since it seems we almost certainly don't want to go down the path of having the CLI validator fail. I've opened another ticket to reflect our current understanding of the problem.

chanzuckerberg / single-cell-curation

cellxgene-schema must validate that values in var.feature_name must be unique #584