Ability to do some common expression normalizations in the dataset curators

IGS / gEAR

The gEAR Portal was created as a data archive and viewer for gene expression data including microarrays, bulk RNA-Seq, single-cell RNA-Seq and more.

https://umgear.org

GNU Affero General Public License v3.0

10 stars 5 forks source link

Ability to do some common expression normalizations in the dataset curators #664

Open JPReceveur opened 2 months ago

JPReceveur commented 2 months ago

One curator feature that would be useful to add is the ability to do common normalizations of gene expression in the dataset curators. Currently, we're depending on users doing a normalization/transformation prior to upload which leads to a number of issues (e.g. comparing with datasets using a different normalization or if they're using already normalized data in the single cell workbench, or if a user forgets to do one).

While we can't keep up with all the normalization procedures it would be nice to have the ability to do some simple ones on a per display basis (e.g. cpm, normalize expression within a cell to one, etc) to better show datasets side by side.

adkinsrs commented 2 months ago

Just to have it on paper, the reason we are asking users to specify if their data is raw or normalized before uploading is to ensure we can do things like this. Many of the early datasets do not have such information so we have to treat them all as "raw" (except the hardcoded datasets)

Seems to me this "normalizing" makes more sense in the expression/projection panels rather than in the curator. Instead of having the user save a curation under a certain normalization configuration, we provide a normalize checkbox along with raw/log2/log10 option, and when rendering displays, we normalize based on the dataset's math setting from uploading. @JPReceveur @jorvis thoughts?

Alternatively, @JPReceveur are you suggesting that we have some options such as normalizing counts per cell in a single dataset before plotting? Seems that this ticket may be more of a hybrid solution

JPReceveur commented 2 months ago

Yeah, I was thinking more of the second one (a lot of the issues around the first point go away once we explicitly are asking people about any transformations or normalizations done prior to upload). Transformations, you can do some math to get back to the initial values but thats not always true for normalizations, we wouldn't always be able to get back to raw data given an upload of normalized data even knowing the method they used.

From a use case perspective, mostly I was thinking about a way for someone to collect datasets with initially different transformations (or no transformations in the case of raw) in one collection and have a way to see the same normalization and transformation across the dataset collection.

adkinsrs commented 2 months ago

Thinking the easiest way to do this would be to implement some of the scanpy "preprocessing" functions, such as normalizing counts per cell. @JPReceveur besides that, and maybe scaling to center around zero mean, are there any others you would be interested in? The "Recipes" section may be good as an all-in-one solution

https://scanpy.readthedocs.io/en/stable/api.html#module-scanpy.pp

JPReceveur commented 2 months ago

For https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html if both sum to 1 and sum to target_sum=1e6 options (e.g. CPM) are there that would cover a lot of people's use cases. Maybe see if Carlo/ Brian have thoughts on other useful ones? Including the filtering and other options in their preprocessing section might be a bit of overkill.

adkinsrs commented 2 months ago

At the gEAR meeting, @jorvis suggested normalization transformation options could be better applicable in the dataset explorer. If the user is applying normalization to one curation, there is a good chance they will apply it to all curations.

adkinsrs commented 2 months ago

Adding @jorvis as an assignee as this may be implemented in the dataset explorer