Closed brianraymor closed 1 year ago
Caught another issue. uns[sex_colors] is a list of 2 values, but there are 3 unique values in obs[sex]. The dataset in the video also has uns[ever_smoker_colors] less than the length of obs[ever_smoker].unique() Trying to color by sex or ever_smoker results in failures.
~We found another instance of this issue with cell_type
within this dataset https://cellxgene.cziscience.com/e/3bbb6cf9-72b9-41be-b568-656de6eb18b5.cxg/~
[since revised] See #single-cell-data-wrangling.
@signechambers1 to document the full set of Explorer requirements by Q2 which is required for inclusion in schema 4.0.
@brianraymor here are the H5AD criteria for _colors functionality for explorer:
Thank you @atarashansky for the help. @brianraymor let us know if you have suggestions or questions.
CC: @jahilton @jychien - can you please review based on your experience in debugging _colors?
Here's the relevant text which was also linked in the top-level summary comment for this issue as a reference:
The metadata fields below are optional. They aren't needed for integration, and cellxgene can display the data fine without them, but if they are included cellxgene will do something with them. This allows submitters to fine-tune how their datasets are presented, which is a common request.
Field name | Description |
color_map | Submitters can include a field called "{field}_colors" for any other categorical integer metadata field. The value must be an array of one of
fourcolor specifications:
|
There is definitely a length requirement where you need to have as many colors as you have categories (comment above)... I believe len(adata.uns["{category}_colors"]) >= len(adata.obs["{category}"].unique())
If any uns color entry corresponds to a non-existent category, validation MUST fast-fail
Currently, this case is ignored & passes validation seemingly without any downstream issues. We do have a block of qa code that checks for this as 'best practice' so I don't have any concern enforcing it in the validator, but just wanted to note the change in behavior. (clean-up of existing datasets would be incredibly straightforward during a migration)
Thank you all! Added the length requirement to my original comment.
@atarashansky can you determine if multiple color formats will work, whether an array or list is expected by explorer, and if colors work for categorical and categorical integer fields?
If possible, please write requirements in text and not in code fragments.
I called this out earlier. Isn't this a missing constraint?
Submitters can include a field called "{field}_colors" for any other categorical integer metadata field.
Updated requirements in plain text with outstanding questions:
Fields for which colors are defined MUST be categorical fields or categorical integer fields (@atarshansky to confirm categorical integer fields are supported)
Categorical integer fields are supported so long as they are pandas Categoricals.
Colors MUST NOT be defined for categorical fields that do not exist
I think colors actually can be defined for categorical fields that do not exist - those colors will just be skipped and not added to the cxg group metadata.
The number of colors defined MUST be equal to or greater than the number of unique values in the categorical field (@atarshansky can you confirm if there are more colors than values it will not error?)
If there are more colors, it will not error. The extra colors will be ignored.
Colors MUST be a list of hex-code colors (@atarshansky to determine if other formats work)
Ah, thank you for correcting me @brianraymor. Here are the actual color requirements. As you indicated, this is validated and will raise an exception so we're good on this front:
The function accepts for the following formats:
- A CSS4 color name, as supported by matplotlib https://matplotlib.org/3.1.0/gallery/color/named_colors.html
- RGB tuple/list with values ranging from 0.0 to 1.0, as in [0.5, 0.75, 1.0]
- RFB tuple/list with values ranging from 0 to 255, as in [128, 192, 255]
- Hex triplet string, as in "#08c0ff"
Open question: Should the schema require colors to be a list or array? (@atarshansky could you look into this too please?)
Either numpy array or python lists work.
- A CSS4 color name, as supported by matplotlib https://matplotlib.org/3.1.0/gallery/color/named_colors.html
@atarashansky @signechambers1 - should this reference be updated to - A CSS4 color name, as supported by matplotlib https://matplotlib.org/stable/gallery/color/named_colors.html or did Explorer pin 3.1.0
?
@atarashansky - should the shape of the _colors ndarray be derived from:
adata.obs['category'].cat.categories
style is referenced in CXG validation. adata.obs['category'].unique()
style is also referenced in this issue based on cellxgene guidance.@bkmartinjr kindly educated me on the differences, because I was too lazy to RTFM this evening:
cat.categories - returns all possible values, even if they are not present unique() - returns the used values
I don't have a horse in this race, but want to document only one approach where the schema matches the implementation. Recommendation?
CC:@jahilton @signechambers1
- A CSS4 color name, as supported by matplotlib https://matplotlib.org/3.1.0/gallery/color/named_colors.html
@atarashansky @signechambers1 - should this reference be updated to - A CSS4 color name, as supported by matplotlib https://matplotlib.org/stable/gallery/color/named_colors.html or did Explorer pin
3.1.0
?
I don't even see matplotlib as a dependency in explorer.
I don't even see matplotlib as a dependency in explorer.
By Explorer, I meant the CXG conversion code for Explorer. There must be some method for validating CSS4 color names?
- RFB tuple/list with values ranging from 0 to 255, as in [128, 192, 255]
I'm unfamiliar with a color model named RFB. From my perspective, it's simply a different way of specifying RGB as illustrated by CSS 4 examples:
The color type provides multiple ways to syntactically specify a given color. For example, the following declarations all specify the sRGB color “lime”:
Could we eliminate / RGB range 0-255 / as a supported case. I was reviewing matplotlib which does not seem to support the RGB range 0-255 style, preferring:
RGB or RGBA (red, green, blue, alpha) tuple of float values in a closed interval [0, 1].
@signechambers1 @atarashansky
RE
The number of colors defined MUST be equal to or greater than the number of unique values in the categorical field (@atarshansky can you confirm if there are more colors than values it will not error?)
If there are more colors, it will not error. The extra colors will be ignored.
I'm reviewing a live example that defines 48 colors for 53 categoricals.
Is the actual requirement - "There must be at least one color?"
Can you link me to this example? If there are 48 colors for 53 categoricals, then that should cause a bug in explorer.
I don't even see matplotlib as a dependency in explorer.
By Explorer, I meant the CXG conversion code for Explorer. There must be some method for validating CSS4 color names?
- RFB tuple/list with values ranging from 0 to 255, as in [128, 192, 255]
I'm unfamiliar with a color model named RFB. From my perspective, it's simply a different way of specifying RGB as illustrated by CSS 4 examples:
The color type provides multiple ways to syntactically specify a given color. For example, the following declarations all specify the sRGB color “lime”:
- { color: lime; } / color keyword /
- { color: rgb(0 255 0); } / RGB range 0-255 /
- { color: rgb(0% 100% 0%); } / RGB range 0%-100% /
- { color: color(sRGB 0 1 0); } / sRGB range 0.0-1.0 /
Could we eliminate / RGB range 0-255 / as a supported case. I was reviewing matplotlib which does not seem to support the RGB range 0-255 style, preferring:
RGB or RGBA (red, green, blue, alpha) tuple of float values in a closed interval [0, 1].
Looks like we pin Matplotlib to 3.6.3: https://github.com/chanzuckerberg/single-cell-data-portal/blob/ec494df7a463e2191607d26874dd604333fa4e36/backend/api_server/requirements.txt#L7
Here, it looks like we always cast to 0-255: https://github.com/chanzuckerberg/single-cell-data-portal/blob/ec494df7a463e2191607d26874dd604333fa4e36/backend/common/utils/color_conversion_utils.py#L180
I agree we should change it to match matplotlib specifications/preferences.
- Is the len of the _colors array based on unique(), cat.categories, or is is just "MUST be at least 1 color"? This is the 48 colors for 53 categoricals case.
The only requirement is that the len of _colors must be greater than or equal to the number of categories (unique() == cat.categories
).
- Both python lists and numpy arrays are currently allowed which increases the text matrix. I recommend stricter requirements and have only allowed ndarrays.
Why ndarrays
as opposed to pure lists? I would prefer lists. ndarray
does not offer any additional functionality, here.
- Supporting four color formats increases the test matrix. @atarashansky has agreed with the suggestion to eliminate RGB range 0-255 because it does not seem to be a supported format in matplotlib which prefers RGB or RGBA (red, green, blue, alpha) tuple of float values in a closed interval [0, 1].
Correct.
@atarashansky
The only requirement is that the len of _colors must be greater than or equal to the number of categories (
unique() == cat.categories
).
I'm not sure how to interpret this statement. Are you indicating that you believe unique to be equivalent to cat.categories? If so, please see this earlier comment.
Why ndarrays as opposed to pure lists? I would prefer lists. ndarray does not offer any additional functionality, here.
It will be converted to an ndarray
when written. See Saving then reading turns list into numpy.ndarray?.
adata_valid.uns['cell_type_colors'] = ['aqua', 'blueviolet', 'chartreuse']
print(type(adata_valid.uns['cell_type_colors']))
<class 'list'>
adata_valid.write('conversion.h5ad')
adata_valid = sc.read_h5ad('conversion.h5ad')
print(type(adata_valid.uns['cell_type_colors']))
<class 'numpy.ndarray'>
Of the three choices:
Confirming the consensus from the #single-cell-data-wrangling:
Option 3 was the preference. My assumption is that Explorer continues to display the ontology labels but uses the related term ids for colors?
This needs to be documented in the schema. Are we going to warn or fail if a submitter specifies tissue_colors
?
It will be converted to an ndarray when written. See Saving then reading turns list into numpy.ndarray?.
Gotcha, wasn't aware of this.
I'm not sure how to interpret this statement. Are you indicating that you believe unique to be equivalent to cat.categories? If so, please see this earlier comment.
You interpreted correctly, and TIL. Then the requirements should be unique() <= len(colors)
.
Option 3 was the preference. My assumption is that Explorer continues to display the ontology labels but uses the related term ids for colors?
At conversion time, Explorer will be updated to assign the custom colors to the corresponding label column's categories.
RE
This needs to be documented in the schema. Are we going to warn or fail if a submitter specifies
tissue_colors
?
This must fail then because tissue_colors
will not be present during validation. It's similar to specifying any other non-existent obs category.
RIght, yes, agreed!
Inquiring minds. What is the scenario for allowing the length of _colors
to be greater than unique()? Is there an expectation that the submitter may be reusing color arrays across multiple categories some of which may be larger than others or ... ?
CC: @jahilton
@brianraymor That would make sense if they just wanted to apply a custom discrete colormap to all categories.
Design
uns
(Dataset Metadata)uns
is a ordered dictionary with astr
key. Curators MAY annotate the following keys and values inuns
:{category}_colors
category
data type column inobs
thatis annotated by the data submitter or curator. The following columns that are annotated by CELLxGENE
Discover MUST NOT be specified as {category}:
numpy.ndarray
. This MUST be a 1-D array of shape(, c)
, wherec
is greater than or equal to thenumber of unique categories in the {category} column as calculated by:
len(anndata.obs.{category}.unique())
The color code at the Nth position in the
ndarray
corresponds to the Nth category of anndata.obs.{category}.unique().For example, if
cell_type_ontology_term_id
includes two unique categories:anndata.obs.cell_type_ontology_term_id.unique()
['CL:0000057', 'CL:0000115']
Categories (2, object): ['CL:0000057', 'CL:0000115']
then
cell-type_ontology_term_id_colors
MUST contain two or more colors such as:['aqua' 'blueviolet']
where
'aqua'
is the color assigned to'CL:0000057'
and'blueviolet'
is the color assigned to'CL:0000115'
.All elements in the
ndarray
MUST use the same color model, limited to:str
. MUST be a case-insensitive CSS4 color name with no spaces such as"aliceblue"
str
. MUST start with"#"
immediately followed by six case-insensitive hexadecimalcharacters as in
"#08c0ff"
Context
_colors
was not included in schema 2.0.0 because it was perceived as an under documented feature of cellxgene desktop and was not present in most submissions in the portal at the time. Also see category colors and RFC: User-defined colors.There was no validation performed by
cellxgene-schema CLI 1.1.0
or in the subsequent cellxgene conversion so support was brittle. The current CXG conversion code limits validation to color format:convert_anndata_category_colors_to_cxg_category_colors
Recently, there was a CXG conversion failure during data ingestion reported in #single-cell-data-wrangling due to the following causes:
_colors
should either be blocked+fail or documented+validated in the future.There are more testing details documented in the Category Colors section of Prepare Data;
To test that you've done this properly, check that for your given category the number of colors match the number of category values and that the second command below results in a mapping from categories to colors.