Open JPReceveur opened 2 months ago
Just to have it on paper, the reason we are asking users to specify if their data is raw or normalized before uploading is to ensure we can do things like this. Many of the early datasets do not have such information so we have to treat them all as "raw" (except the hardcoded datasets)
Seems to me this "normalizing" makes more sense in the expression/projection panels rather than in the curator. Instead of having the user save a curation under a certain normalization configuration, we provide a normalize checkbox along with raw/log2/log10 option, and when rendering displays, we normalize based on the dataset's math setting from uploading. @JPReceveur @jorvis thoughts?
Alternatively, @JPReceveur are you suggesting that we have some options such as normalizing counts per cell in a single dataset before plotting? Seems that this ticket may be more of a hybrid solution
Yeah, I was thinking more of the second one (a lot of the issues around the first point go away once we explicitly are asking people about any transformations or normalizations done prior to upload). Transformations, you can do some math to get back to the initial values but thats not always true for normalizations, we wouldn't always be able to get back to raw data given an upload of normalized data even knowing the method they used.
From a use case perspective, mostly I was thinking about a way for someone to collect datasets with initially different transformations (or no transformations in the case of raw) in one collection and have a way to see the same normalization and transformation across the dataset collection.
Thinking the easiest way to do this would be to implement some of the scanpy "preprocessing" functions, such as normalizing counts per cell. @JPReceveur besides that, and maybe scaling to center around zero mean, are there any others you would be interested in? The "Recipes" section may be good as an all-in-one solution
https://scanpy.readthedocs.io/en/stable/api.html#module-scanpy.pp
For https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html if both sum to 1 and sum to target_sum=1e6 options (e.g. CPM) are there that would cover a lot of people's use cases. Maybe see if Carlo/ Brian have thoughts on other useful ones? Including the filtering and other options in their preprocessing section might be a bit of overkill.
At the gEAR meeting, @jorvis suggested normalization transformation options could be better applicable in the dataset explorer. If the user is applying normalization to one curation, there is a good chance they will apply it to all curations.
Adding @jorvis as an assignee as this may be implemented in the dataset explorer
One curator feature that would be useful to add is the ability to do common normalizations of gene expression in the dataset curators. Currently, we're depending on users doing a normalization/transformation prior to upload which leads to a number of issues (e.g. comparing with datasets using a different normalization or if they're using already normalized data in the single cell workbench, or if a user forgets to do one).
While we can't keep up with all the normalization procedures it would be nice to have the ability to do some simple ones on a per display basis (e.g. cpm, normalize expression within a cell to one, etc) to better show datasets side by side.