Investigate Seurat Failure in Schema Dev Migration

nayib-jose-gloria commented 1 year ago

See link. Determine root cause, and if it is triggered because of the migration.

Collection Affected: https://cellxgene.cziscience.com/collections/b1a879f6-5638-48d3-8f64-f6592c1b1561 Dataset: HSC/immune cells (all hematopoietic-derived cells)

Error is:

"Seurat conversion failed: b'' b\"Loading required package: reticulate\\nLoading required package: devtools\\nLoading required package: usethis\\nLoading required namespace: Seurat\\nThe legacy packages maptools, rgdal, and rgeos, underpinning the sp package,\\nwhich was just loaded, will retire in October 2023.\\nPlease refer to R-spatial evolution reports for details, especially\\nhttps://r-spatial.org/r/2023/05/15/evolution4.html.\\nIt may be desirable to make the sf package available;\\npackage maintainers should consider adding sf to Suggests:.\\nThe sp package is now running under evolution status 2\\n (status 2 uses the sf package in place of rgdal)\\nrgeos version: 0.6-4, (SVN revision 699)\\n GEOS runtime version: 3.8.0-CAPI-1.13.1 \\n Please note that rgeos will be retired during October 2023,\\nplan transition to sf or terra functions using GEOS at your earliest convenience.\\nSee https://r-spatial.org/r/2023/05/15/evolution4.html for details.\\n Linking to sp version: 2.0-0 \\n Polygon checking: TRUE \\n\\nError in h(simpleError(msg, call)) : \\n error in evaluating the argument 'x' in selecting a method for function 't': invalid class \\xe2\\x80\\x9cdgCMatrix\\xe2\\x80\\x9d object: slot i is not *strictly* increasing inside a column\\nCalls: ... initialize -> callNextMethod -> .nextMethod -> validObject\\nExecution halted\\n\""

nayib-jose-gloria commented 1 year ago

Lead: same dataset, even using its pre-migrated form, fails on Seurat conversion using cellxgene-schema 3.0.2 but succeeds with an rdev upload using cellxgene-schema 3.0.1 to write-labels. This tracks with fact that it succeeded to upload in March 2023, and 3.0.2 was released in April

Bento007 commented 1 year ago

From @pablo-gar

For the latest Seurat failure that Trent asked me to look at I found the problem.

First, the fix

import anndata as ad

adata = ad.read(adata_file)

if not adata.X.has_canonical:
   adata.X.sum_duplicates()

if not adata.raw.X.has_canonical:
   adata.raw.X.sum_duplicates()

adata.write(adata_fixed_file, compresssion = "gzip")

adata_fixed_filecan now be converted to Seurat.

Explanation

This was very hard to find (took me close to 2 hrs). Sometime during the migration when you guys are writing the new files, the scipy sparse matrices seem to change format in how they are stored. This is one oddity of Python that doesn’t follow the general csc/csr standard (maybe for some efficiency increases :man-shrugging::skin-tone-4:), but R does follow the general standard and it rightly errors out when encountering this oddity. The oddity is that Python csr/csc matrices allow for repeated indices and if doing so it flags a matrix with has_canonical_format == False. So at some point in the migration, matrices went from canonical format to non-canonical format, see this link for an explanation of the differences: https://docs.scipy.org/doc/scipy/tutorial/sparse.html#canonical-formats

Suggested solutions

Option A: During the label addition phase of the validator, it should convert sparse matrices to canonical if they are not (as shown in the snippet above)
Option B: Fail validation if sparse matrices are not in canonical format and ask curators to fix it. I’d recommend A over B.

If you want to be good citizens of the world you should open an issue in reticulate describing the issue and asking to support the odd non-canonical Python format (which would be easy to do for them) in R. If this was supported by reticulate you guys would have never encountered this problem, and I’m sure other may/will encounter this problem.

Bento007 commented 1 year ago

Updating an rdev environment with a cellxgene-schema-cli branch updated with option A as described above. If the seurat conversion succeeds, we can declare victory.

pablo-gar commented 1 year ago

Small errata on the code snippet I provided, it should be adata.X.has_canonical_format and adata.raw.X.has_canonical_format

Also this fix defeats a bit of the recent work to avoid loading the full object to memory -- although in this case the computation is not too expensive when it is loaded in memory. There may be ways to make this matrices "canonical" without loading full object in memory.

Bento007 commented 1 year ago

The root cause is still because the labeled dataset is not stored in canonical form. This was cause by changes introduced in 3.0.2. Those changes are reverts to 3.0.1 here ->70cb7f5.. It's not clear why this fix the problem. At no point in adding labels is the dataset check for canonical form or explicitly converted to canonical form. The unlabeled dataset that highlighted the problem was not in canonical form, but became canonical after adding labels and writing the file to a new h5ad.

My best guess at this point is that X and raw.X were made canonical during the write. In 3.0.2 we switch to a view when adding labels to the anndata, this was to reduce the memory requirements. This meant that properties that were not accessed did not need to read into memory, for example the X, and X.raw values. I suspect that when those fields are read into memory and written back out to a new file, anndata does some optimizing of the data such as converting to canonical.

Bento007 commented 1 year ago

As an experiment the failing h5ad was validated and labeled using cellxgene_schema_cli 3.0.1 and 3.0.2. The size of the output file from 3.0.1 was significantly smaller than 3.0.2. This mean some additional optimization of the data occurred in 3.0.1, that did not happen with 3.0.2.

One signifcant difference between the two version is in 3.0.1 the h5ad is read into memory before applying labels. In 3.0.2 the h5ad is read in backed mode only some properties are read into memory.

nayib-jose-gloria commented 1 year ago

Fixed in https://github.com/chanzuckerberg/single-cell-curation/pull/598

Bento007 commented 1 year ago

After running some tests, it is clear that reading the annData into memory and writing it back out in compressed form to a file does make X and raw.X canonical. This is verified by reading the file back out into memory and checking for canonical.

Bento007 commented 1 year ago

reopening since the problem happened again with the same dataset

chanzuckerberg / single-cell