brianraymor commented 2 years ago

Design

`uns` (Dataset Metadata)

uns is a ordered dictionary with a str key. Curators MAY annotate the following keys and values in uns:

{category}_colors

Key

{category}_colors where {category} MUST be the name of a category data type column in obs that
is annotated by the data submitter or curator. The following columns that are annotated by CELLxGENE
Discover MUST NOT be specified as {category}:

assay
cell_type
development_stage
disease
organism
self_reported_ethnicity
sex
tissue

Annotator Curator

Value

numpy.ndarray. This MUST be a 1-D array of shape (, c), where c is greater than or equal to the
number of unique categories in the {category} column as calculated by:

len(anndata.obs.{category}.unique())

The color code at the Nth position in the ndarray corresponds to the Nth category of anndata.obs.{category}.unique().

For example, if cell_type_ontology_term_id includes two unique categories:

anndata.obs.cell_type_ontology_term_id.unique()

['CL:0000057', 'CL:0000115'] Categories (2, object): ['CL:0000057', 'CL:0000115']

then cell-type_ontology_term_id_colors MUST contain two or more colors such as:

['aqua' 'blueviolet']

where 'aqua' is the color assigned to 'CL:0000057' and 'blueviolet' is the color assigned to
'CL:0000115'.

All elements in the ndarray MUST use the same color model, limited to:

Color Model	Element Format
Named Colors	`str`. MUST be a case-insensitive CSS4 color name with no spaces such as `"aliceblue"`
Hex Triplet	`str`. MUST start with `"#"` immediately followed by six case-insensitive hexadecimal characters as in `"#08c0ff"`

Context

_colors was not included in schema 2.0.0 because it was perceived as an under documented feature of cellxgene desktop and was not present in most submissions in the portal at the time. Also see category colors and RFC: User-defined colors.

There was no validation performed by cellxgene-schema CLI 1.1.0 or in the subsequent cellxgene conversion so support was brittle. The current CXG conversion code limits validation to color format:

convert_anndata_category_colors_to_cxg_category_colors

Recently, there was a CXG conversion failure during data ingestion reported in #single-cell-data-wrangling due to the following causes:

at least one color table existed for a non-existent obs column
at least one color table existed for a non-categorical obs column

_colors should either be blocked+fail or documented+validated in the future.

There are more testing details documented in the Category Colors section of Prepare Data;

To test that you've done this properly, check that for your given category the number of colors match the number of category values and that the second command below results in a mapping from categories to colors.

>>> len(adata.obs[category].cat.categories) == len(adata.uns[f"{category}_colors"])

True

>>> dict(zip(adata.obs[category].cat.categories, adata.uns[f"{category}_colors"]))
{'CD4 T cells': '#1f77b4', 'CD14+ Monocytes': '#ff7f0e', 'B cells': '#2ca02c', 'CD8 T cells': '#d62728', 'NK cells': '#9467bd', 'FCGR3A+ Monocytes': '#8c564b', 'Dendritic cells': '#e377c2', 'Megakaryocytes': '#bcbd22'}

brianraymor commented 2 years ago

Boom!

jahilton commented 2 years ago

Caught another issue. uns[sex_colors] is a list of 2 values, but there are 3 unique values in obs[sex]. The dataset in the video also has uns[ever_smoker_colors] less than the length of obs[ever_smoker].unique() Trying to color by sex or ever_smoker results in failures.

https://user-images.githubusercontent.com/15098808/187527103-ff36488b-05a4-487d-b324-3e566bf9810d.mov

seve commented 1 year ago

~We found another instance of this issue with cell_type within this dataset https://cellxgene.cziscience.com/e/3bbb6cf9-72b9-41be-b568-656de6eb18b5.cxg/~

[since revised] See #single-cell-data-wrangling.

brianraymor commented 1 year ago

@signechambers1 to document the full set of Explorer requirements by Q2 which is required for inclusion in schema 4.0.

signechambers1 commented 1 year ago

@brianraymor here are the H5AD criteria for _colors functionality for explorer:

adata.obs["{category}"] MUST be of type pandas.Categorical
adata.uns["{category}_colors"] MUST be a list of hex-code colors corresponding to each category in adata.obs["{category"].cat.categories
If any uns color entry corresponds to a non-existent category, validation MUST fast-fail
If any uns color entry corresponds to a category that is improperly formatted (i.e. not a pandas.Categorical or does not have the right number of labels), validation MUST fast-fail
len(adata.uns["{category}_colors"]) >= len(adata.obs["{category}"].unique())

Thank you @atarashansky for the help. @brianraymor let us know if you have suggestions or questions.

brianraymor commented 1 year ago

CC: @jahilton @jychien - can you please review based on your experience in debugging _colors?

There is no explanation for how _colors is surfaced in Explorer. You might compare with embeddings which needs an update to the screenshot due to the new Explorer layout.
The new requirements do not match the earlier requirements in schema 1.1.0 - which were also not enforced. For example, the 1.1.0 version supported multiple formats for the values. 1.1.0 also specified "for any other categorical integer metadata field".

Here's the relevant text which was also linked in the top-level summary comment for this issue as a reference:

Presentation Hints

The metadata fields below are optional. They aren't needed for integration, and cellxgene can display the data fine without them, but if they are included cellxgene will do something with them. This allows submitters to fine-tune how their datasets are presented, which is a common request.

Field name

Description

color_map

Submitters can include a field called "{field}_colors" for any other categorical integer metadata field. The value must be an array of one of fourcolor specifications:

CSS4 color name, as supported by matplotlib https://matplotlib.org/3.1.0/gallery/color/named_colors.html
RGB tuple/list with values ranging from 0.0 to 1.0, as in [0.5, 0.75, 1.0]
RFB tuple/list with values ranging from 0 to 255, as in [128, 192, 255]
Hex triplet string, as in "#08c0ff"and each string must be a hex color code.

The color code at the nth position in the array corresponds to category n in the metadata field.

jahilton commented 1 year ago

There is definitely a length requirement where you need to have as many colors as you have categories (comment above)... I believe len(adata.uns["{category}_colors"]) >= len(adata.obs["{category}"].unique())

If any uns color entry corresponds to a non-existent category, validation MUST fast-fail

Currently, this case is ignored & passes validation seemingly without any downstream issues. We do have a block of qa code that checks for this as 'best practice' so I don't have any concern enforcing it in the validator, but just wanted to note the change in behavior. (clean-up of existing datasets would be incredibly straightforward during a migration)

jychien commented 1 year ago

Scanpy has documentation for palette for it's plot function, where it points to matplotlib colors as being valid. Not sure what is meant by the wording of the 1.1.0 schema, but my understanding is also that the corresponding column in obs need to be Categorical.
Looks like '(category)_colors' can be an array or list, but unclear if it is implied that the schema will be requiring it to be one and not the other.

signechambers1 commented 1 year ago

Thank you all! Added the length requirement to my original comment.

@atarashansky can you determine if multiple color formats will work, whether an array or list is expected by explorer, and if colors work for categorical and categorical integer fields?

brianraymor commented 1 year ago

If possible, please write requirements in text and not in code fragments.

I called this out earlier. Isn't this a missing constraint?

Submitters can include a field called "{field}_colors" for any other categorical integer metadata field.

signechambers1 commented 1 year ago

Updated requirements in plain text with outstanding questions:

Fields for which colors are defined MUST be categorical fields or categorical integer fields (@atarshansky to confirm categorical integer fields are supported)
Colors MUST NOT be defined for categorical fields that do not exist
The number of colors defined MUST be equal to or greater than the number of unique values in the categorical field (@atarshansky can you confirm if there are more colors than values it will not error?)
Colors MUST be a list of hex-code colors (@atarshansky to determine if other formats work)
Open question: Should the schema require colors to be a list or array? (@atarshansky could you look into this too please?)

atarashansky commented 1 year ago

Fields for which colors are defined MUST be categorical fields or categorical integer fields (@atarshansky to confirm categorical integer fields are supported)

Categorical integer fields are supported so long as they are pandas Categoricals.

Colors MUST NOT be defined for categorical fields that do not exist

I think colors actually can be defined for categorical fields that do not exist - those colors will just be skipped and not added to the cxg group metadata.

The number of colors defined MUST be equal to or greater than the number of unique values in the categorical field (@atarshansky can you confirm if there are more colors than values it will not error?)

If there are more colors, it will not error. The extra colors will be ignored.

Colors MUST be a list of hex-code colors (@atarshansky to determine if other formats work)

Ah, thank you for correcting me @brianraymor. Here are the actual color requirements. As you indicated, this is validated and will raise an exception so we're good on this front:

The function accepts for the following formats:
- A CSS4 color name, as supported by matplotlib https://matplotlib.org/3.1.0/gallery/color/named_colors.html
- RGB tuple/list with values ranging from 0.0 to 1.0, as in [0.5, 0.75, 1.0]
- RFB tuple/list with values ranging from 0 to 255, as in [128, 192, 255]
- Hex triplet string, as in "#08c0ff"

Open question: Should the schema require colors to be a list or array? (@atarshansky could you look into this too please?)

Either numpy array or python lists work.

brianraymor commented 1 year ago

A CSS4 color name, as supported by matplotlib https://matplotlib.org/3.1.0/gallery/color/named_colors.html

@atarashansky @signechambers1 - should this reference be updated to - A CSS4 color name, as supported by matplotlib https://matplotlib.org/stable/gallery/color/named_colors.html or did Explorer pin 3.1.0?

brianraymor commented 1 year ago

@atarashansky - should the shape of the _colors ndarray be derived from:

adata.obs['category'].cat.categories style is referenced in CXG validation.
adata.obs['category'].unique() style is also referenced in this issue based on cellxgene guidance.

@bkmartinjr kindly educated me on the differences, because I was too lazy to RTFM this evening:

cat.categories - returns all possible values, even if they are not present unique() - returns the used values

I don't have a horse in this race, but want to document only one approach where the schema matches the implementation. Recommendation?

CC:@jahilton @signechambers1

atarashansky commented 1 year ago

A CSS4 color name, as supported by matplotlib https://matplotlib.org/3.1.0/gallery/color/named_colors.html

@atarashansky @signechambers1 - should this reference be updated to - A CSS4 color name, as supported by matplotlib https://matplotlib.org/stable/gallery/color/named_colors.html or did Explorer pin 3.1.0?

I don't even see matplotlib as a dependency in explorer.

brianraymor commented 1 year ago

I don't even see matplotlib as a dependency in explorer.

By Explorer, I meant the CXG conversion code for Explorer. There must be some method for validating CSS4 color names?

RFB tuple/list with values ranging from 0 to 255, as in [128, 192, 255]

I'm unfamiliar with a color model named RFB. From my perspective, it's simply a different way of specifying RGB as illustrated by CSS 4 examples:

The color type provides multiple ways to syntactically specify a given color. For example, the following declarations all specify the sRGB color “lime”:

{ color: lime; } / color keyword /
{ color: rgb(0 255 0); } / RGB range 0-255 /
{ color: rgb(0% 100% 0%); } / RGB range 0%-100% /
{ color: color(sRGB 0 1 0); } / sRGB range 0.0-1.0 /

Could we eliminate / RGB range 0-255 / as a supported case. I was reviewing matplotlib which does not seem to support the RGB range 0-255 style, preferring:

RGB or RGBA (red, green, blue, alpha) tuple of float values in a closed interval [0, 1].

brianraymor commented 1 year ago

@signechambers1 @atarashansky

RE

The number of colors defined MUST be equal to or greater than the number of unique values in the categorical field (@atarshansky can you confirm if there are more colors than values it will not error?)

If there are more colors, it will not error. The extra colors will be ignored.

I'm reviewing a live example that defines 48 colors for 53 categoricals.

Is the actual requirement - "There must be at least one color?"

atarashansky commented 1 year ago

Can you link me to this example? If there are 48 colors for 53 categoricals, then that should cause a bug in explorer.

atarashansky commented 1 year ago

I don't even see matplotlib as a dependency in explorer.

By Explorer, I meant the CXG conversion code for Explorer. There must be some method for validating CSS4 color names?

RFB tuple/list with values ranging from 0 to 255, as in [128, 192, 255]

I'm unfamiliar with a color model named RFB. From my perspective, it's simply a different way of specifying RGB as illustrated by CSS 4 examples:

The color type provides multiple ways to syntactically specify a given color. For example, the following declarations all specify the sRGB color “lime”:

{ color: lime; } / color keyword /

{ color: rgb(0 255 0); } / RGB range 0-255 /

{ color: rgb(0% 100% 0%); } / RGB range 0%-100% /

{ color: color(sRGB 0 1 0); } / sRGB range 0.0-1.0 /

Could we eliminate / RGB range 0-255 / as a supported case. I was reviewing matplotlib which does not seem to support the RGB range 0-255 style, preferring:

RGB or RGBA (red, green, blue, alpha) tuple of float values in a closed interval [0, 1].

Looks like we pin Matplotlib to 3.6.3: https://github.com/chanzuckerberg/single-cell-data-portal/blob/ec494df7a463e2191607d26874dd604333fa4e36/backend/api_server/requirements.txt#L7

Here, it looks like we always cast to 0-255: https://github.com/chanzuckerberg/single-cell-data-portal/blob/ec494df7a463e2191607d26874dd604333fa4e36/backend/common/utils/color_conversion_utils.py#L180

I agree we should change it to match matplotlib specifications/preferences.

atarashansky commented 1 year ago

Is the len of the _colors array based on unique(), cat.categories, or is is just "MUST be at least 1 color"? This is the 48 colors for 53 categoricals case.

The only requirement is that the len of _colors must be greater than or equal to the number of categories (unique() == cat.categories).

Both python lists and numpy arrays are currently allowed which increases the text matrix. I recommend stricter requirements and have only allowed ndarrays.

Why ndarrays as opposed to pure lists? I would prefer lists. ndarray does not offer any additional functionality, here.

Supporting four color formats increases the test matrix. @atarashansky has agreed with the suggestion to eliminate RGB range 0-255 because it does not seem to be a supported format in matplotlib which prefers RGB or RGBA (red, green, blue, alpha) tuple of float values in a closed interval [0, 1].

Correct.

brianraymor commented 1 year ago

@atarashansky

The only requirement is that the len of _colors must be greater than or equal to the number of categories (unique() == cat.categories).

I'm not sure how to interpret this statement. Are you indicating that you believe unique to be equivalent to cat.categories? If so, please see this earlier comment.

brianraymor commented 1 year ago

Why ndarrays as opposed to pure lists? I would prefer lists. ndarray does not offer any additional functionality, here.

It will be converted to an ndarray when written. See Saving then reading turns list into numpy.ndarray?.


adata_valid.uns['cell_type_colors'] = ['aqua', 'blueviolet', 'chartreuse']
print(type(adata_valid.uns['cell_type_colors']))

<class 'list'>

adata_valid.write('conversion.h5ad')
adata_valid = sc.read_h5ad('conversion.h5ad')
print(type(adata_valid.uns['cell_type_colors']))

<class 'numpy.ndarray'>

brianraymor commented 1 year ago

Of the three choices:

Confirming the consensus from the #single-cell-data-wrangling:

Detect the case and automate renaming tissue_ontology_term_id_colors to tissue_colors
Detect the case and automate copying tissue_ontology_term_id_colors to tissue_colors (maintains the submitters original choice) 3. Change Explorer to just use the term id(s) which are present in the original submission (labels are applied later by CELLxGENE during ingestion)

Option 3 was the preference. My assumption is that Explorer continues to display the ontology labels but uses the related term ids for colors?

This needs to be documented in the schema. Are we going to warn or fail if a submitter specifies tissue_colors?

atarashansky commented 1 year ago

It will be converted to an ndarray when written. See Saving then reading turns list into numpy.ndarray?.

Gotcha, wasn't aware of this.

I'm not sure how to interpret this statement. Are you indicating that you believe unique to be equivalent to cat.categories? If so, please see this earlier comment.

You interpreted correctly, and TIL. Then the requirements should be unique() <= len(colors).

Option 3 was the preference. My assumption is that Explorer continues to display the ontology labels but uses the related term ids for colors?

At conversion time, Explorer will be updated to assign the custom colors to the corresponding label column's categories.

brianraymor commented 1 year ago

RE

This needs to be documented in the schema. Are we going to warn or fail if a submitter specifies tissue_colors?

This must fail then because tissue_colors will not be present during validation. It's similar to specifying any other non-existent obs category.

atarashansky commented 1 year ago

RIght, yes, agreed!

brianraymor commented 1 year ago

Inquiring minds. What is the scenario for allowing the length of _colors to be greater than unique()? Is there an expectation that the submitter may be reusing color arrays across multiple categories some of which may be larger than others or ... ?

CC: @jahilton

atarashansky commented 1 year ago

@brianraymor That would make sense if they just wanted to apply a custom discrete colormap to all categories.

chanzuckerberg / single-cell-curation

do or do not (there is no try) _colors #216

Design

`uns` (Dataset Metadata)

{category}_colors

Context

Presentation Hints

chanzuckerberg / single-cell-curation

do or do not (there is no try) _colors #216

Design

uns (Dataset Metadata)

{category}_colors

Context

Presentation Hints

`uns` (Dataset Metadata)