Closed brianraymor closed 11 months ago
Originally, this was a pending epic in #single-cell because it appeared that both DP and Data-Viz would be responsible for changes in the duplicated validation code. Since this is now completely in the domain of Data-Viz, I'm moving this issue to #single-cell-curation with the rest of the CLI validation updates.
Moving back to In Progress
to reflect @joyceyan 's change on Friday.
Believe the unsigned integer
note is unnecessary as (our understanding of this is that) python isn't aware of signed vs unsigned, they'll just be read as int. If this is in correct, then help us understand so we know what to test for.
QA notes (our testing notebook that contains each example)...
obsm key X_
passes validation but {suffix} MUST be at least one character long
To recreate, take an existing .h5ad with embeddings (this is a small valid example), read inn, then adata.obsm['X_'] = adata.obsm['X_umap']
obsm key X_
passes validation but {suffix} MUST be at least one character long
I wouldn't think whitespace counted as a character, but maybe that's debatable.
adata.obsm['X_ '] = adata.obsm['X_umap']
obsm with any np.nan fails validation but it should pass so long as all are not np.nan
coord1 = [4] + [np.nan] * (adata.obsm['X_umap'].shape[0] - 1)
coord2 = [-54] + [np.nan] * (adata.obsm['X_umap'].shape[0] - 1)
adata.obsm['X_umap'] = np.column_stack((coord1, coord2))
obsm with any or all np.nan reports misleading error about infinity values.
ERROR: adata.obsm['X_umap'] contains positive infinity or negative infinity values.
embeddings with type string does not fail gracefully.
adata.obsm['X_umap'] = adata.obsm['X_umap'].astype('str')
Error...
Starting validation... Traceback (most recent call last): File "/opt/anaconda3/envs/cxg4testing/bin/cellxgene-schema", line 8, in
sys.exit(schema_cli()) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/click/core.py", line 1130, in call return self.main(args, kwargs) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/click/core.py", line 760, in invoke return __callback(args, **kwargs) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/cellxgene_schema/cli.py", line 47, in schema_validate isvalid, , _ = validate(h5ad_file, add_labels_file, ignore_labels=ignore_labels, verbose=verbose) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/cellxgene_schema/validate.py", line 1220, in validate validator.validate_adata(h5ad_path) File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/cellxgene_schema/validate.py", line 1174, in validate_adata self._deep_check() File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/cellxgene_schema/validate.py", line 1135, in _deep_check self._validate_embedding_dict() File "/opt/anaconda3/envs/cxg4testing/lib/python3.9/site-packages/cellxgene_schema/validate.py", line 833, in _validate_embedding_dict if np.isinf(value).any() or np.isnan(value).any(): TypeError: ufunc 'isinf' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Confirmed that the behavior noted above now has been corrected.
Still the lingering question about whether unsigned integer
should be referenced in the schema and validator. @atarashansky
And these 2 new questions arose during this round of QA...
The h5ad passes validation if one obsm key does not start with X_
as long as one obsm key does
We decided that all embeddings must pass validation (discussion), and my interpretation is that the key is part of that validation. So I would expect a file with obsm keys of X_harmony
& umap
to fail validation.
The h5ad does not pass validation if one obsm key contains whitespaces (eg X_ U M A P
)
This is not specified in the schema. @atarashansky Should the schema be updated to specify this? Or should the validator allow whitespaces?
@signechambers1 and @atarashansky - can you respond to @jahilton 's questions? I'm happy to make any subsequent edits on your behalf.
Confirmed that the behavior noted above now has been corrected.
Still the lingering question about whether
unsigned integer
should be referenced in the schema and validator. @atarashanskyAnd these 2 new questions arose during this round of QA...
- The h5ad passes validation if one obsm key does not start with
X_
as long as one obsm key does We decided that all embeddings must pass validation (discussion), and my interpretation is that the key is part of that validation. So I would expect a file with obsm keys ofX_harmony
&umap
to fail validation.- The h5ad does not pass validation if one obsm key contains whitespaces (eg
X_ U M A P
) This is not specified in the schema. @atarashansky Should the schema be updated to specify this? Or should the validator allow whitespaces?
Unsigned integer does not exist in python.
Alternatively, we could have the validator replace whitespaces with underscores and note this behavior in the schema. I think the latter would be a nice gesture to automatically fix cases where submitters reasonably label an embedding as "X_embedding one".
I disagree. It's an edge case that should simply fail. If necessary, the curators can have a conversation with the submitter about renames. We try very hard not to make the test matrix more complicated especially for edge cases.
RE unsigned integer
, Bruce and I landed on positive integers
in another context:
All non-zero values MUST be positive integers stored as numpy.float32.
RE
unsigned integer
, Bruce and I landed onpositive integers
in another context:
All non-zero values MUST be positive integers stored as numpy.float32.
Why are we constraining OBSM embeddings to only contain positive integers?
RE white space - it's a slippery slope. I don't know what is allowed by the underlying data types, but often you need to worry about special characters too if you want to be extremely precise.
And I was avoiding BNF in the schema a'la - https://www.cl.cam.ac.uk/~jac22/books/www/book/node166.html
Why are we constraining OBSM embeddings to only contain positive integers?
Your original draft suggested unsigned integers (?), so I was responding to that point.
Not sure which original draft you're referring to. I think
MUST have a data type of float, integer, or unsigned integer of any precision (8, 16, 32, or 64 bits)
can be
MUST have a data type of float or integer of any precision (8, 16, 32, or 64 bits)
@jahilton - did you successfully test 64-bit values?
I ask because the R language does not natively support 64 bit integers. Embeddings are included in the seurat encoding.
The valid.h5ad that we use as a base for testing has 2 embeddings - one is int64, the other is float64. The first test in the QA notebook tests this file without alteration. It passes validation.
And is that an existing test file that has previously been successfully ingested (in an earlier incarnation as a 3.1.0 dataset) and completed RDS conversion? I'm a little skittish about RDS conversions these days.
RDS conversion is not part of our standard QA But I did just submit valid.h5ad to dev & Seurat conversion was seemingly successful
Scratches head
. OK then, thanks for humoring me. I downloaded the RDS and will take a look later based on my extremely limited knowledge of Seurat.
I was reviewing the CXG specification: [my emphasis]
All TileDB arrays MUST have a uint32 domain, zero-based. All X counts and embedding coordinates SHOULD be coerced to float32, which is ample precision for visualization purposes, and MUST be a numeric type.
and emb group and embedding arrays
A CXG MUST have a group named emb, which will contain all embeddings and MUST have at least one embedding array. Embeddings are encoded as TileDB arrays, of numeric type and shape (nobs, >=2). The arrays MUST be a numeric type, and SHOULD be coerced to float32. Inf and NaN values MAY be used, but the semantics for these values is undefined. The TileDB array name will be assumed to be the embedding name (conventionally, embedding names in CXG are not prefixed with an X as they are in AnnData).
I don't quite understand the subtlety of the "SHOULD be coerced to float32".
Should embedding values in the schema simply be numpy.float32
? CC: @bkmartinjr @atolopko-czi
Further, I believe that the requirements must be clear that the data types are numpy data types and not python data types, because the data types are not equivalent:
The behaviour of NumPy and Python integer types differs significantly for integer overflows and may confuse users expecting NumPy integers to behave similar to Python’s int. Unlike NumPy, the size of Python’s int is flexible. This means Python integers may expand to accommodate any integer and will not overflow.
Python’s floating-point numbers are usually 64-bit floating-point numbers, nearly equivalent to np.float64. In some unusual situations it may be useful to use floating-point numbers with more precision.
~Since CXG spec is private and for Explorer's use only, it should be fine to update spec to match what is currently in use (code). "MUST be coerced to Numpy float32" seems reasonable.~ [This is the case for X matrix, not the obs embedding]
An earlier statement in the spec provides some justification: "All X counts and embedding coordinates SHOULD be coerced to float32, which is ample precision for visualization purposes". Perhaps the intention was to downcast from float64 to 32 (but not less than 32), to be space-efficient without incurring unwanted loss of precision. It doesn't appear the CXG transformation logic does this, at any rate--it just passes through the same obsm dtype as the source AnnData obsm.
Confirmed that the behavior noted above now has been corrected.
Still the lingering question about whether
unsigned integer
should be referenced in the schema and validator. @atarashanskyAnd these 2 new questions arose during this round of QA...
- The h5ad passes validation if one obsm key does not start with
X_
as long as one obsm key does We decided that all embeddings must pass validation (discussion), and my interpretation is that the key is part of that validation. So I would expect a file with obsm keys ofX_harmony
&umap
to fail validation.- The h5ad does not pass validation if one obsm key contains whitespaces (eg
X_ U M A P
) This is not specified in the schema. @atarashansky Should the schema be updated to specify this? Or should the validator allow whitespaces?
Actually, I was thinking more about point (1) - could the input AnnData's include embeddings that are not meant for visualization? If so, then we may want to use "X" prefix as the convention to distinguish between embeddings that will be displayed in explorer from embeddings that are not meant for visualization. If that's the case, then validation should pass as long as there is at least one "X" embedding.
@brianraymor @jahilton What do you think?
Actually, I was thinking more about point (1) - could the input AnnData's include embeddings that are not meant for visualization?
Yes.
If so, then we may want to use "X_" prefix as the convention to distinguish between embeddings that will be displayed in explorer from embeddings that are not meant for visualization.
I think that the original point of the prefix. Perhaps the schema should be more clear on that point.
If that's the case, then validation should pass as long as there is at least one "X_" embedding.
What if there are multiple "X_" embeddings and one or more fails validation?
I'm also more generally concerned about all embeddings because all are included in the Seurat conversion.
RE the whitespace issue, I'd rewrite to:
X_{suffix} where {suffix} MUST be at least one character long and MUST NOT contain whitespace characters.
What if there are multiple "X_" embeddings and one or more fails validation?
Right - validation should pass for all "X_" embeddings and explorer-related criteria should not apply to embeddings without the prefix
Perhaps we could separate out requirements for generic embeddings vs CXG (X_) embeddings.
I would be curious if anndata already enforces 1
and 2
on .write or when set. It would be nice if we didn't need to revalidate. @pablo-gar - do you know if anndata performs any validation of obsm embeddings?
Then CXG validator would only need to enforce:
For example - https://github.com/scverse/anndata/blob/102f35a193915a1b3ca50448c397d103b21571cf/anndata/_core/anndata.py#L1894
MUST have the same number of rows as X is enforced by anndata. If I delete a row in a valid embedding and attempt to set:
adata.obsm['X_umap'] = bad_embedding
then an exception is thrown:
ValueError: Value passed for key 'X_umap' is of incorrect shape. Values of obsm must match dimensions (0,) of parent. Value had shape (1999,) while it should have had (2000,).
There's still a question of whether MUST include at least two columns is enforced by anndata.
@pablo-gar - do you know if anndata performs any validation of obsm embeddings?
Unfortunately I don't
RE MUST include at least two columns
Hmm - anndata doesn't seem to mind if I delete a column from a valid embedding and assign an ndarray of shape (obs_n, 1).
Under anndata that's a valid embedding, so I think that's expected. Valid embeddings can be (obs_n, X) where X is any number greater than 0. If anything I would expect anndata to validate the number or rows.
It does validate rows per my previous comment.
could the input AnnData's include embeddings that are not meant for visualization?
I have never heard of anyone request this option. I would say let's not leave that door open unless we have reason. Do we have some use case in mind?
A couple of thoughts (Brian asked me to read the thread).
From this issue, I can't tell if the goal is to:
With that confusion so declared....
Background:
obsm
(see the latest anndata API). They don't even have to be named with a string, etc. There are things that the validator could check here (e.g., not a ndarray struct, but rather a string-keyed mapping).X_
prefixed string (e.g., X_pca
), and store a numpy ndarray as a value. See docs. No guarantees in ScanPy that the value will be a float32.Which is to say, you can get lots of random stuff in there - essentially any n-D observation of length n_obs (of any type). The obsm
is not just for low-dimensionality visualization.
Explorer is slightly pickier, and the last I looked used the following rules:
X_foo
and of type ndarray
float/int/uint will be displayed to the user as a foo
embedding. See code here. Other stuff is silently ignored.So Explorer will cherry-pick what it can display and ignore the rest.
My take-aways:
obsm
, it may be a problem for some use cases, and given Explorer's behavior, no real benefit (AFAIK)X_
. No need for a schema change at all.From this issue, I can't tell if the goal is to:
@bkmartinjr - My comments from https://github.com/chanzuckerberg/single-cell-curation/issues/544
See #single-cell-data-wrangling and #single-cell-eng.
During CXG conversion, validation for embeddings silently failed allowing the ingestion pipeline to complete successfully. When the curator explored the dataset, an "Unexpected HTTP error" was shown because a 400 was returned by the internal cellxgene REST API due to missing embeddings in the CXG.
Subsequent code review also revealed that both the cellxgene-schema CLI
and the CXG conversion process validate embeddings with some duplication, but their validation requirements are different.
Please review and document the complete set of requirements in the schema.
Ah! That clears things up. Would I be correct to state the goal as: ensure that all datasets have at least one Explorer-viewable embedding?
If so, IMHO the most practical path is:
could the input AnnData's include embeddings that are not meant for visualization?
I have never heard of anyone request this option. I would say let's not leave that door open unless we have reason. Do we have some use case in mind?
Interesting you haven't heard that from contributors yet. Integrated embeddings whose dimensions haven't been reduced will almost always be > 2 columns and are not meant for visualization, think NMF factors, latent spaces, or even the actual PCA matrix.
I think we should still allow for this specially if they are not using the "X_" prefix
Would I be correct to state the goal as: ensure that all datasets have at least one Explorer-viewable embedding?
That would be one requirement. But if there are more than one Explorer-viewable embeddings, then all must pass or validation fails.
I also want to ensure that non X_ embeddings do not cause RDS conversations to fail in sceasy because RDS conversion is the most brittle part of ingestion.
Once https://github.com/chanzuckerberg/single-cell/issues/420 is addressed, we'll have a fast way to see what embeddings are currently present in the corpus.
Anything that is named X_foo and of type ndarray float/int/uint will be displayed to the user as a foo embedding. See code here. Other stuff is silently ignored.
Reading:
is_valid = is_valid and isinstance(embedding_array, np.ndarray) and embedding_array.dtype.kind in "fiu"
Comprehension.
The "fiu" is the source of Alec's original draft - "The embedding array dtype must be a float, integer, or unsigned integer ..." which resulted in a bit of confusion.
code | kind |
---|---|
i | signed integer |
u | unsigned integer |
f | floating-point |
so, the requirement could be rewritten to:
numpy.dtype.kind
of "f"
, "i"
, or "u"
.But if there are more than one Explorer-viewable embeddings, then all must pass or validation fails.
Depending on how I parse this, it might mean:
Given that Explorer already applies its own criteria at run-time, and ignores anything which doesn't pass, anything stronger than "at least one embedding must pass" runs the risk of failing a legitimate dataset.
Given that Explorer already applies its own criteria at run-time, and ignores anything which doesn't pass, anything stronger than "at least one embedding must pass" runs the risk of failing a legitimate dataset.
The purpose of this work is to move validation into the CLI and out of the CXG conversion where possible.
I don't understand how that would be a legitimate dataset if:
X_ embeddings are defined for presentation in Explorer
I think this may be part of the confusion. X_
embeddings are a ScanPy convention, which Explorer adopted because it was a better UX for user's who use both tools. This means that in theory, you could have a dataset with an X_ embedding that Explorer can't fully display.
In practice this hasn't been a major concern, as Explorer protects itself from this failure and AFAIK most (all?) ScanPy-generated embeddings are viewable. That is likely changing with spatial and new assays.
I am guessing that the drive for the CLI enhancement is to catch cases where we expect the embedding to display, but it does not? If so, we probably need some other means to explicitly notate either:
I don't think we can get away with mandating that all X_ embeddings must be viewable in Explorer....
Thanks for the background, Bruce.
Choices could be:
CC: @jahilton @signechambers1
This means that in theory, you could have a dataset with an X_ embedding that Explorer can't fully display.
@bkmartinjr - BTW, was this behavior previously documented somewhere (other than the code)?
was this behavior previously documented somewhere
No idea. Previously there was not such a strong linkage between Explorer behavior and schema definition, so I would guess that the code was the truth.
Is it an option to warn when any embedding (X_ or not) fail requirements and thus will not be available in Explorer?
Is it an option to warn when any embedding (X_ or not) fail requirements and thus will not be available in Explorer?
Mechanically that should be quite easy to add given that the current pass/fail is well-defined in the Explorer code base. I leave the "policy" decision to Signe/Brian.
Is it an option to warn when any embedding (X_ or not) fail requirements and thus will not be available in Explorer?
It's possible. not X_
would never be presented in Explorer so I'm unclear about the ask.
And could have more generic requirements per Bruce's pointer to the union:
obsm ndarray | Mapping | None (default: None) Key-indexed multi-dimensional observations annotation of length #observations. If passing a ndarray, it needs to have a structured datatype.
I like @jahilton's suggestion. From the product perspective, I'd like any embeddings the author's want to be available in Explorer to be available and also I do not want to not create extra work for curators or submitters.
I like @jahilton's suggestion. From the product perspective, I'd like any embeddings the author's want to be available in Explorer to be available and also I do not want to not create extra work for curators or submitters.
I'm not sure that I understand Jason's proposal or your agreement. Apologies.
Currently, an author indicates which embeddings they want in Explorer by using X_ ( a convention reused from scanpy). This may fail because Explorer has additional undocumented requirements per Bruce's context.
And there may be other embeddings that do not use X and are currently ignored in the schema because X was supposed to be a presentation hint.
Is the proposal:
ndarray
using CXG requirements. Warn for failures. Only present successful X_ embeddings in Explorer.ndarray
using CXG requirements. Warn for failures. Present successful X_ and not _X embeddings in Explorer (which would mean that we would need to deal with collisions like X_pca and pca). I am seeing the X_ format as just another requirement for an embedding to be used in Explorer. Not sure what makes it a 'presentation hint' vs a 'requirement'.
If an embedding is present with a key that does not start with X_, I would find it useful for the validator to surface that so that I can be aware and inform the contributor that it won't be accessible in the Explorer unless something is changed.
I don't quite understand @jahilton's comment but @brianraymor's proposal "1. Validate ALL embeddings of type ndarray using CXG requirements. Warn for failures. Only present successful X_ embeddings in Explorer." is consistent with my thinking.
Validate ALL embeddings of type ndarray using CXG requirements. Warn for failures. Only present successful X_ embeddings in Explorer." is consistent with my thinking.
+1
Design
Note: Both the
cellxgene-schema CLI
and the CXG conversion process validate embeddings with some duplication, but their validation requirements are different. All validation must occur in the CLI to fast fail prior to the submission and upload of the dataset.cellxgene-schema CLI
See schema 4
Note: @jahilton has requested that obsm embeddings that do not start with X also be validated against the CXG requirements in X{suffix} and generate warnings that include the embedding name. This is not a schema requirement because additional embeddings may be present for other scenarios, but an implementation requirement to provide information to the curators to assess the state of embeddings in the dataset. @joyceyan has noted this in #654.
obsm
(Embeddings)To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as
numpy.ndarrays
inobsm
.X_{suffix}
See also
default_embedding
inuns
.numpy.ndarray
with the following requirementsX
and MUST include at least two columnsnumpy.dtype.kind
of"f"
,"i"
, or "u"
numpy.inf
) or negative infinity (numpy.NINF
) valuesnumpy.nan
) values