chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

X_{suffix} must be validated #590

Closed brianraymor closed 11 months ago

brianraymor commented 1 year ago

Design

Note: Both the cellxgene-schema CLIand the CXG conversion process validate embeddings with some duplication, but their validation requirements are different. All validation must occur in the CLI to fast fail prior to the submission and upload of the dataset.

  1. Embedding validation in CXG conversion
  2. Embedding validation in cellxgene-schema CLI

See schema 4

Note: @jahilton has requested that obsm embeddings that do not start with X also be validated against the CXG requirements in X{suffix} and generate warnings that include the embedding name. This is not a schema requirement because additional embeddings may be present for other scenarios, but an implementation requirement to provide information to the curators to assess the state of embeddings in the dataset. @joyceyan has noted this in #654.

obsm (Embeddings)

To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays in obsm.

X_{suffix}

Key X_{suffix} where {suffix} MUST be at least one character long and MUST NOT contain whitespace characters. The {suffix} is presented as text to users in the Embedding Choice selector in CELLxGENE Explorer so it is STRONGLY RECOMMENDED that it be descriptive.

See also default_embedding in uns.
Annotator Curator
Value numpy.ndarray with the following requirements


brianraymor commented 1 year ago

Originally, this was a pending epic in #single-cell because it appeared that both DP and Data-Viz would be responsible for changes in the duplicated validation code. Since this is now completely in the domain of Data-Viz, I'm moving this issue to #single-cell-curation with the rest of the CLI validation updates.

metakuni commented 1 year ago

Moving back to In Progress to reflect @joyceyan 's change on Friday.

jahilton commented 1 year ago

Believe the unsigned integer note is unnecessary as (our understanding of this is that) python isn't aware of signed vs unsigned, they'll just be read as int. If this is in correct, then help us understand so we know what to test for.

QA notes (our testing notebook that contains each example)...

  1. obsm key X_ passes validation but {suffix} MUST be at least one character long To recreate, take an existing .h5ad with embeddings (this is a small valid example), read inn, then adata.obsm['X_'] = adata.obsm['X_umap']

  2. obsm key X_ passes validation but {suffix} MUST be at least one character long I wouldn't think whitespace counted as a character, but maybe that's debatable. adata.obsm['X_ '] = adata.obsm['X_umap']

  3. obsm with any np.nan fails validation but it should pass so long as all are not np.nan coord1 = [4] + [np.nan] * (adata.obsm['X_umap'].shape[0] - 1) coord2 = [-54] + [np.nan] * (adata.obsm['X_umap'].shape[0] - 1) adata.obsm['X_umap'] = np.column_stack((coord1, coord2))

  4. obsm with any or all np.nan reports misleading error about infinity values.

    ERROR: adata.obsm['X_umap'] contains positive infinity or negative infinity values.

  5. embeddings with type string does not fail gracefully. adata.obsm['X_umap'] = adata.obsm['X_umap'].astype('str') Error...

Starting validation... Traceback (most recent call last): File "/opt/anaconda3/envs/cxg4testing/bin/cellxgene-schema", line 8, in sys.exit(schema_cli()) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/click/core.py", line 1130, in call return self.main(args, kwargs) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/click/core.py", line 760, in invoke return __callback(args, **kwargs) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/cellxgene_schema/cli.py", line 47, in schema_validate isvalid, , _ = validate(h5ad_file, add_labels_file, ignore_labels=ignore_labels, verbose=verbose) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/cellxgene_schema/validate.py", line 1220, in validate validator.validate_adata(h5ad_path) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/cellxgene_schema/validate.py", line 1174, in validate_adata self._deep_check() File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/cellxgene_schema/validate.py", line 1135, in _deep_check self._validate_embedding_dict() File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/cellxgene_schema/validate.py", line 833, in _validate_embedding_dict if np.isinf(value).any() or np.isnan(value).any(): TypeError: ufunc 'isinf' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

jahilton commented 1 year ago

Confirmed that the behavior noted above now has been corrected.

Still the lingering question about whether unsigned integer should be referenced in the schema and validator. @atarashansky

And these 2 new questions arose during this round of QA...

  1. The h5ad passes validation if one obsm key does not start with X_ as long as one obsm key does We decided that all embeddings must pass validation (discussion), and my interpretation is that the key is part of that validation. So I would expect a file with obsm keys of X_harmony & umap to fail validation.

  2. The h5ad does not pass validation if one obsm key contains whitespaces (eg X_ U M A P) This is not specified in the schema. @atarashansky Should the schema be updated to specify this? Or should the validator allow whitespaces?

brianraymor commented 1 year ago

@signechambers1 and @atarashansky - can you respond to @jahilton 's questions? I'm happy to make any subsequent edits on your behalf.

atarashansky commented 1 year ago

Confirmed that the behavior noted above now has been corrected.

Still the lingering question about whether unsigned integer should be referenced in the schema and validator. @atarashansky

And these 2 new questions arose during this round of QA...

  1. The h5ad passes validation if one obsm key does not start with X_ as long as one obsm key does We decided that all embeddings must pass validation (discussion), and my interpretation is that the key is part of that validation. So I would expect a file with obsm keys of X_harmony & umap to fail validation.
  2. The h5ad does not pass validation if one obsm key contains whitespaces (eg X_ U M A P) This is not specified in the schema. @atarashansky Should the schema be updated to specify this? Or should the validator allow whitespaces?

Unsigned integer does not exist in python.

  1. That is also my expectation.
  2. I don't think white spaces should ever be allowed. We could explicitly encode that in the schema and validator. Alternatively, we could have the validator replace whitespaces with underscores and note this behavior in the schema. I think the latter would be a nice gesture to automatically fix cases where submitters reasonably label an embedding as "X_embedding one".
brianraymor commented 1 year ago

Alternatively, we could have the validator replace whitespaces with underscores and note this behavior in the schema. I think the latter would be a nice gesture to automatically fix cases where submitters reasonably label an embedding as "X_embedding one".

I disagree. It's an edge case that should simply fail. If necessary, the curators can have a conversation with the submitter about renames. We try very hard not to make the test matrix more complicated especially for edge cases.

brianraymor commented 1 year ago

RE unsigned integer, Bruce and I landed on positive integers in another context:

All non-zero values MUST be positive integers stored as numpy.float32.

atarashansky commented 1 year ago

RE unsigned integer, Bruce and I landed on positive integers in another context:

All non-zero values MUST be positive integers stored as numpy.float32.

Why are we constraining OBSM embeddings to only contain positive integers?

brianraymor commented 1 year ago

RE white space - it's a slippery slope. I don't know what is allowed by the underlying data types, but often you need to worry about special characters too if you want to be extremely precise.

And I was avoiding BNF in the schema a'la - https://www.cl.cam.ac.uk/~jac22/books/www/book/node166.html

brianraymor commented 1 year ago

Why are we constraining OBSM embeddings to only contain positive integers?

Your original draft suggested unsigned integers (?), so I was responding to that point.

atarashansky commented 1 year ago

Not sure which original draft you're referring to. I think

MUST have a data type of float, integer, or unsigned integer of any precision (8, 16, 32, or 64 bits)

can be

MUST have a data type of float or integer of any precision (8, 16, 32, or 64 bits)

brianraymor commented 1 year ago

@jahilton - did you successfully test 64-bit values?

I ask because the R language does not natively support 64 bit integers. Embeddings are included in the seurat encoding.

jahilton commented 1 year ago

The valid.h5ad that we use as a base for testing has 2 embeddings - one is int64, the other is float64. The first test in the QA notebook tests this file without alteration. It passes validation.

brianraymor commented 1 year ago

And is that an existing test file that has previously been successfully ingested (in an earlier incarnation as a 3.1.0 dataset) and completed RDS conversion? I'm a little skittish about RDS conversions these days.

jahilton commented 1 year ago

RDS conversion is not part of our standard QA But I did just submit valid.h5ad to dev & Seurat conversion was seemingly successful

brianraymor commented 1 year ago

Scratches head. OK then, thanks for humoring me. I downloaded the RDS and will take a look later based on my extremely limited knowledge of Seurat.

brianraymor commented 1 year ago

I was reviewing the CXG specification: [my emphasis]

All TileDB arrays MUST have a uint32 domain, zero-based. All X counts and embedding coordinates SHOULD be coerced to float32, which is ample precision for visualization purposes, and MUST be a numeric type.

and emb group and embedding arrays

A CXG MUST have a group named emb, which will contain all embeddings and MUST have at least one embedding array. Embeddings are encoded as TileDB arrays, of numeric type and shape (nobs, >=2). The arrays MUST be a numeric type, and SHOULD be coerced to float32. Inf and NaN values MAY be used, but the semantics for these values is undefined. The TileDB array name will be assumed to be the embedding name (conventionally, embedding names in CXG are not prefixed with an X as they are in AnnData).

I don't quite understand the subtlety of the "SHOULD be coerced to float32".

Should embedding values in the schema simply be numpy.float32? CC: @bkmartinjr @atolopko-czi

Further, I believe that the requirements must be clear that the data types are numpy data types and not python data types, because the data types are not equivalent:

Overflow Errors

The behaviour of NumPy and Python integer types differs significantly for integer overflows and may confuse users expecting NumPy integers to behave similar to Python’s int. Unlike NumPy, the size of Python’s int is flexible. This means Python integers may expand to accommodate any integer and will not overflow.

Extended Precision

Python’s floating-point numbers are usually 64-bit floating-point numbers, nearly equivalent to np.float64. In some unusual situations it may be useful to use floating-point numbers with more precision.

atolopko-czi commented 12 months ago

~Since CXG spec is private and for Explorer's use only, it should be fine to update spec to match what is currently in use (code). "MUST be coerced to Numpy float32" seems reasonable.~ [This is the case for X matrix, not the obs embedding]

An earlier statement in the spec provides some justification: "All X counts and embedding coordinates SHOULD be coerced to float32, which is ample precision for visualization purposes". Perhaps the intention was to downcast from float64 to 32 (but not less than 32), to be space-efficient without incurring unwanted loss of precision. It doesn't appear the CXG transformation logic does this, at any rate--it just passes through the same obsm dtype as the source AnnData obsm.

atarashansky commented 12 months ago

Confirmed that the behavior noted above now has been corrected.

Still the lingering question about whether unsigned integer should be referenced in the schema and validator. @atarashansky

And these 2 new questions arose during this round of QA...

  1. The h5ad passes validation if one obsm key does not start with X_ as long as one obsm key does We decided that all embeddings must pass validation (discussion), and my interpretation is that the key is part of that validation. So I would expect a file with obsm keys of X_harmony & umap to fail validation.
  2. The h5ad does not pass validation if one obsm key contains whitespaces (eg X_ U M A P) This is not specified in the schema. @atarashansky Should the schema be updated to specify this? Or should the validator allow whitespaces?

Actually, I was thinking more about point (1) - could the input AnnData's include embeddings that are not meant for visualization? If so, then we may want to use "X" prefix as the convention to distinguish between embeddings that will be displayed in explorer from embeddings that are not meant for visualization. If that's the case, then validation should pass as long as there is at least one "X" embedding.

@brianraymor @jahilton What do you think?

brianraymor commented 12 months ago

Actually, I was thinking more about point (1) - could the input AnnData's include embeddings that are not meant for visualization?

Yes.

If so, then we may want to use "X_" prefix as the convention to distinguish between embeddings that will be displayed in explorer from embeddings that are not meant for visualization.

I think that the original point of the prefix. Perhaps the schema should be more clear on that point.

If that's the case, then validation should pass as long as there is at least one "X_" embedding.

What if there are multiple "X_" embeddings and one or more fails validation?

I'm also more generally concerned about all embeddings because all are included in the Seurat conversion.

brianraymor commented 12 months ago

RE the whitespace issue, I'd rewrite to:

X_{suffix} where {suffix} MUST be at least one character long and MUST NOT contain whitespace characters.

atarashansky commented 12 months ago

What if there are multiple "X_" embeddings and one or more fails validation?

Right - validation should pass for all "X_" embeddings and explorer-related criteria should not apply to embeddings without the prefix

brianraymor commented 12 months ago

Perhaps we could separate out requirements for generic embeddings vs CXG (X_) embeddings.

  1. Is the shape of embeddings correct?
  2. Are the values for embeddings correct?

I would be curious if anndata already enforces 1 and 2 on .write or when set. It would be nice if we didn't need to revalidate. @pablo-gar - do you know if anndata performs any validation of obsm embeddings?

Then CXG validator would only need to enforce:

  1. There must be at least one X_{suffix} embedding in obsm.
  2. And the {suffix} for all X_ embeddings must not contain whitespace characters.
brianraymor commented 12 months ago

For example - https://github.com/scverse/anndata/blob/102f35a193915a1b3ca50448c397d103b21571cf/anndata/_core/anndata.py#L1894

MUST have the same number of rows as X is enforced by anndata. If I delete a row in a valid embedding and attempt to set:

adata.obsm['X_umap'] = bad_embedding

then an exception is thrown:

ValueError: Value passed for key 'X_umap' is of incorrect shape. Values of obsm must match dimensions (0,) of parent. Value had shape (1999,) while it should have had (2000,).

There's still a question of whether MUST include at least two columns is enforced by anndata.

pablo-gar commented 12 months ago

@pablo-gar - do you know if anndata performs any validation of obsm embeddings?

Unfortunately I don't

brianraymor commented 12 months ago

RE MUST include at least two columns

Hmm - anndata doesn't seem to mind if I delete a column from a valid embedding and assign an ndarray of shape (obs_n, 1).

pablo-gar commented 12 months ago

Under anndata that's a valid embedding, so I think that's expected. Valid embeddings can be (obs_n, X) where X is any number greater than 0. If anything I would expect anndata to validate the number or rows.

brianraymor commented 12 months ago

It does validate rows per my previous comment.

jahilton commented 12 months ago

could the input AnnData's include embeddings that are not meant for visualization?

I have never heard of anyone request this option. I would say let's not leave that door open unless we have reason. Do we have some use case in mind?

bkmartinjr commented 12 months ago

A couple of thoughts (Brian asked me to read the thread).

From this issue, I can't tell if the goal is to:

With that confusion so declared....

Background:

Which is to say, you can get lots of random stuff in there - essentially any n-D observation of length n_obs (of any type). The obsm is not just for low-dimensionality visualization.

Explorer is slightly pickier, and the last I looked used the following rules:

So Explorer will cherry-pick what it can display and ignore the rest.

My take-aways:

brianraymor commented 12 months ago

From this issue, I can't tell if the goal is to:

@bkmartinjr - My comments from https://github.com/chanzuckerberg/single-cell-curation/issues/544

See #single-cell-data-wrangling and #single-cell-eng.

During CXG conversion, validation for embeddings silently failed allowing the ingestion pipeline to complete successfully. When the curator explored the dataset, an "Unexpected HTTP error" was shown because a 400 was returned by the internal cellxgene REST API due to missing embeddings in the CXG.

Subsequent code review also revealed that both the cellxgene-schema CLIand the CXG conversion process validate embeddings with some duplication, but their validation requirements are different.

  1. Embedding validation in CXG conversion
  2. Embedding validation in cellxgene-schema CLI

Please review and document the complete set of requirements in the schema.

bkmartinjr commented 12 months ago

Ah! That clears things up. Would I be correct to state the goal as: ensure that all datasets have at least one Explorer-viewable embedding?

If so, IMHO the most practical path is:

pablo-gar commented 12 months ago

could the input AnnData's include embeddings that are not meant for visualization?

I have never heard of anyone request this option. I would say let's not leave that door open unless we have reason. Do we have some use case in mind?

Interesting you haven't heard that from contributors yet. Integrated embeddings whose dimensions haven't been reduced will almost always be > 2 columns and are not meant for visualization, think NMF factors, latent spaces, or even the actual PCA matrix.

I think we should still allow for this specially if they are not using the "X_" prefix

brianraymor commented 12 months ago

Would I be correct to state the goal as: ensure that all datasets have at least one Explorer-viewable embedding?

That would be one requirement. But if there are more than one Explorer-viewable embeddings, then all must pass or validation fails.

I also want to ensure that non X_ embeddings do not cause RDS conversations to fail in sceasy because RDS conversion is the most brittle part of ingestion.

Once https://github.com/chanzuckerberg/single-cell/issues/420 is addressed, we'll have a fast way to see what embeddings are currently present in the corpus.

brianraymor commented 12 months ago

Anything that is named X_foo and of type ndarray float/int/uint will be displayed to the user as a foo embedding. See code here. Other stuff is silently ignored.

Reading:

is_valid = is_valid and isinstance(embedding_array, np.ndarray) and embedding_array.dtype.kind in "fiu"

Comprehension.

The "fiu" is the source of Alec's original draft - "The embedding array dtype must be a float, integer, or unsigned integer ..." which resulted in a bit of confusion.

code kind
i signed integer
u unsigned integer
f floating-point

so, the requirement could be rewritten to:

bkmartinjr commented 12 months ago

But if there are more than one Explorer-viewable embeddings, then all must pass or validation fails.

Depending on how I parse this, it might mean:

Given that Explorer already applies its own criteria at run-time, and ignores anything which doesn't pass, anything stronger than "at least one embedding must pass" runs the risk of failing a legitimate dataset.

brianraymor commented 12 months ago

Given that Explorer already applies its own criteria at run-time, and ignores anything which doesn't pass, anything stronger than "at least one embedding must pass" runs the risk of failing a legitimate dataset.

The purpose of this work is to move validation into the CLI and out of the CXG conversion where possible.

I don't understand how that would be a legitimate dataset if:

  1. X embeddings are defined for presentation in Explorer. There may be multiple X embeddings for example - spatial cases.
  2. If one X_ embedding is valid and another is not, then it's not a legitimate dataset. The submitter/curator will wonder why all their embeddings are not being presented in Explorer. We want to fast fail so problems can be corrected.
bkmartinjr commented 12 months ago

X_ embeddings are defined for presentation in Explorer

I think this may be part of the confusion. X_ embeddings are a ScanPy convention, which Explorer adopted because it was a better UX for user's who use both tools. This means that in theory, you could have a dataset with an X_ embedding that Explorer can't fully display.

In practice this hasn't been a major concern, as Explorer protects itself from this failure and AFAIK most (all?) ScanPy-generated embeddings are viewable. That is likely changing with spatial and new assays.

I am guessing that the drive for the CLI enhancement is to catch cases where we expect the embedding to display, but it does not? If so, we probably need some other means to explicitly notate either:

I don't think we can get away with mandating that all X_ embeddings must be viewable in Explorer....

brianraymor commented 12 months ago

Thanks for the background, Bruce.

Choices could be:

  1. Add a notation to indicate when a specific X_ is for Explorer presentation and ignore others
  2. Or Warn when there are multiple X_ embeddings and some fail Explorer requirements so the curator/submitter can correct as needed

CC: @jahilton @signechambers1

brianraymor commented 12 months ago

This means that in theory, you could have a dataset with an X_ embedding that Explorer can't fully display.

@bkmartinjr - BTW, was this behavior previously documented somewhere (other than the code)?

bkmartinjr commented 12 months ago

was this behavior previously documented somewhere

No idea. Previously there was not such a strong linkage between Explorer behavior and schema definition, so I would guess that the code was the truth.

jahilton commented 12 months ago

Is it an option to warn when any embedding (X_ or not) fail requirements and thus will not be available in Explorer?

bkmartinjr commented 12 months ago

Is it an option to warn when any embedding (X_ or not) fail requirements and thus will not be available in Explorer?

Mechanically that should be quite easy to add given that the current pass/fail is well-defined in the Explorer code base. I leave the "policy" decision to Signe/Brian.

brianraymor commented 12 months ago

Is it an option to warn when any embedding (X_ or not) fail requirements and thus will not be available in Explorer?

It's possible. not X_ would never be presented in Explorer so I'm unclear about the ask.

And could have more generic requirements per Bruce's pointer to the union:

obsm ndarray | Mapping | None (default: None) Key-indexed multi-dimensional observations annotation of length #observations. If passing a ndarray, it needs to have a structured datatype.

signechambers1 commented 12 months ago

I like @jahilton's suggestion. From the product perspective, I'd like any embeddings the author's want to be available in Explorer to be available and also I do not want to not create extra work for curators or submitters.

brianraymor commented 12 months ago

I like @jahilton's suggestion. From the product perspective, I'd like any embeddings the author's want to be available in Explorer to be available and also I do not want to not create extra work for curators or submitters.

I'm not sure that I understand Jason's proposal or your agreement. Apologies.

Currently, an author indicates which embeddings they want in Explorer by using X_ ( a convention reused from scanpy). This may fail because Explorer has additional undocumented requirements per Bruce's context.

And there may be other embeddings that do not use X and are currently ignored in the schema because X was supposed to be a presentation hint.

Is the proposal:

  1. Validate ALL embeddings of type ndarray using CXG requirements. Warn for failures. Only present successful X_ embeddings in Explorer.
  2. Validate ALL embeddings of type ndarray using CXG requirements. Warn for failures. Present successful X_ and not _X embeddings in Explorer (which would mean that we would need to deal with collisions like X_pca and pca).
jahilton commented 12 months ago

I am seeing the X_ format as just another requirement for an embedding to be used in Explorer. Not sure what makes it a 'presentation hint' vs a 'requirement'.

If an embedding is present with a key that does not start with X_, I would find it useful for the validator to surface that so that I can be aware and inform the contributor that it won't be accessible in the Explorer unless something is changed.

signechambers1 commented 12 months ago

I don't quite understand @jahilton's comment but @brianraymor's proposal "1. Validate ALL embeddings of type ndarray using CXG requirements. Warn for failures. Only present successful X_ embeddings in Explorer." is consistent with my thinking.

jahilton commented 12 months ago

Validate ALL embeddings of type ndarray using CXG requirements. Warn for failures. Only present successful X_ embeddings in Explorer." is consistent with my thinking.

+1