ebi-gene-expression-group / scanpy-scripts

Scripts for using scanpy
Apache License 2.0
29 stars 13 forks source link

Store matrix variants #96

Closed pinin4fjords closed 3 years ago

pinin4fjords commented 3 years ago

I would like to enable our anndata workflow to more fully represent the steps of our analysis. This would facilitate the ability to produce our 'bundles' (the file set that's digested to make experiment displays in SCXA) directly from annData files (our own and those of external collaborators).

Currently we use adata.raw to retain normalised expression values, which I think is a relic of when the variable genes function would actually slice up .X rather than annotating variable genes as it does now.

The changes in this PR:

So at the end of our workflow we would then have:

Thoughts?

Edit 29/4/21:

I decided that a general solution was in order, so I propose that we allow every matrix-changing function the same option to save the input matrix before changing it. This is rather than saving outputs to specified slots- since .X is not optional in many cases this would probably have involved some duplication of data (to standard .X and e.g. specified layers).

I've therefore added an add_matrix_function() routine inspired by @nh3 's add_plot_function() routine that allows this additional functionality to be wrapped around every matrix-changing function.

This should be mostly backwards-compatible since it's new functionality not enabled by default. The only exception is the normalise step, where I replaced the save functionality, and consequently normalised data will not now be placed in .raw by default.

Plot functions should be able to use any content placed in .layers() via existing parameterisation.

(oh, and I also set the testing to Scanpy's baked-in test data rather than downloading some)

nh3 commented 3 years ago

It's a good idea to retain the "raw-est" data possible in "raw". Perhaps cpm-normalised-and-log1p-transformed data needs to go to a layer as you'd normally have scaled data in X at the end of the workflow? Also, as a consequence of this change, the plotting functions probably should use that layer by default.

pcm32 commented 3 years ago

I would also add that we should avoid carrying over that much data across the workflow, as on every step we are reading and then writing that data, which for large experiments might be large, unless that it is needed. Maybe we should have a way of leaving certain bits on files after a certain step, proceed with a leaner AnnData that is less effort to read, load into memory and write, and them at the end, add all the needed unnormalised/unscaled/super-raw matrices to the final AnnData generated as output.

pcm32 commented 3 years ago

For instance, when we distribute clustering, UMP and tSNE, we are n-cating (as in duplicating) every AnnData file as many times as distribution points we have.

pinin4fjords commented 3 years ago

@pcm32 fair point, I think we can add some matrix merging to your tool at the end. I'd like to merge this anyway if that's okay though, the options are off by default.

This would actually make things leaner, due to that deactivation of storing the normalised data in .raw.

pcm32 commented 3 years ago

@pcm32 fair point, I think we can add some matrix merging to your tool at the end. I'd like to merge this anyway if that's okay though, the options are off by default.

This would actually make things leaner, due to that deactivation of storing the normalised data in .raw.

By all means, my workflow comments are a bit misplaced here, they just came to my mind when I saw the PR.

pinin4fjords commented 3 years ago

Thanks for feedback and discussion @pcm32