Question on input to ENVI

soerenab commented 8 months ago

Hi,

I have a question regarding the input to ENVI. I read in one of your comments in another issue that "Also, make sure the data is not logged (in the .X), since ENVI expected unlogged counts."

However, following your tutorial and inspecting the data

# !wget https://dp-lab-data-public.s3.amazonaws.com/ENVI/sc_data.h5ad
# !wget https://dp-lab-data-public.s3.amazonaws.com/ENVI/st_data.h5ad
st_data = sc.read_h5ad('st_data.h5ad')
sc_data = sc.read_h5ad('sc_data.h5ad')

I noticed that st_data.X.max() = 247.12617 sc_data.X.max() = 4360.0 i.e., sc data seems "raw" while spatial data seems to have been processed in some way.

Now I am wondering: how should the sc and sp data be processed when handing it to ENVI?

Thanks a lot!

DoronHav commented 8 months ago

Hello,

The datasets (which are public sources from https://www.nature.com/articles/s41586-021-03705-x) were processed, but the counts are not in log domain. Specifically, the counts in spatial data were normalized by cell size and some batch correction was performed.

We recommend going through all standard motions of single-cell analysis (library size normalization, doublet detection, etc.) for each dataset, and then passing processed (but un-logged) data onto ENVI.

soerenab commented 8 months ago

Thanks for the reply - just to double check: in your above comment you recommend to do library size normalization for each datasets. Yet, the dissociated dataset in the tutorial seems to contain raw counts, i.e. the data has not been normalized. So should I only normalize the spatial but not the dissociated dataset or does it not matter whether the dissociated dataset has been normalized?

shahrozeabbas commented 7 months ago

Both spatial and single-cell should be processed (filtered for low quality data, etc) but raw counts should be used as input to the VAE. @DoronHav please correct me if I'm wrong, but I assume this is the case.

dpeerlab / ENVI

Question on input to ENVI #7