Can't replicate data preprocessing

morphy380 commented 2 years ago

I am confused how the data was transformed between the variable genes matrix and the full matrix provided. It seems the variable genes matrix is normalized + log-transformed but I don't get perfect correlation after this transformation. Could you provide the code for that preprocessing ?

PATH = '../../git/wot/notebooks/data/'
CELL_DAYS_PATH = 'data/cell_days.txt'
FULL_DS_PATH = 'data/ExprMatrix.h5ad'
VAR_DS_PATH = 'data/ExprMatrix.var.genes.h5ad'
FLE_COORDS_PATH ='data/fle_coords.txt'

coord_df = pd.read_csv(PATH+FLE_COORDS_PATH, index_col='id', sep='\t')
days_df = pd.read_csv(PATH+CELL_DAYS_PATH, index_col='id', sep='\t')
mask = [ind in days_df.index for ind in coord_df.index]
adataf = sc.read_h5ad(PATH+FULL_DS_PATH)[mask]
adata = sc.read_h5ad(PATH+VAR_DS_PATH)

df = adata[:1000,:].to_df()
dff = adataf[:1000,:].to_df()
dff_hvg = dff[df.columns]
assert np.all(df.index==dff.index)

dff_norm_hvg = np.log1p(10000*dff_hvg.div(dff_hvg.sum(axis=1),axis=0))
dff_norm = np.log1p(10000*dff.div(dff.sum(axis=1),axis=0))
#correlation is not one
plt.scatter(x=dff_norm['Fam150a'],y=df['Fam150a'],s=1)
#correlation is not one also with hvg
plt.scatter(x=dff_norm_hvg['Fam150a'],y=df['Fam150a'],s=1)

joshua-gould commented 2 years ago

After selecting variable genes, the variable gene expression matrix was renormalized. See https://www.cell.com/cell/fulltext/S0092-8674(19)30039-X#secsectitle0115, Preparation of expression matrices.

morphy380 commented 2 years ago

Thanks for the answer. I did read that section, but still can't replicate it when renormalizing. Here is my code (with adata and adataf as loaded above)

I tried the following:

unnormalized full matrix -> subset variable genes -> normalize -> log-transform
normalize the full matrix -> subset variable genes -> normalize -> log-transform

df = adata[:1000,:].to_df()
dff = adataf[:1000,:].to_df()
assert np.all(df.index==dff.index)

dff_hvg = dff[df.columns]
dff_hvg_norm = 10000*dff_hvg.div(dff_hvg.sum(axis=1),axis=0)
dff_hvg_norm_log = np.log1p(dff_hvg_norm)

dff_norm = 10000*dff.div(dff.sum(axis=1),axis=0)
dff_norm_hvg = dff_norm[df.columns]
dff_norm_hvg_norm = 10000*dff_norm_hvg.div(dff_norm_hvg.sum(axis=1),axis=0)
dff_norm_hvg_norm_log = np.log1p(dff_norm_hvg_norm)

plt.scatter(x=dff_hvg_norm_log['Fam150a'],y=df['Fam150a'],s=1)
plt.scatter(x=dff_norm_hvg_norm_log['Fam150a'],y=df['Fam150a'],s=1)

broadinstitute / wot

Can't replicate data preprocessing #98