StandardScaler and sc.pp.scale

Hello, I am writing to you as I have been reading your article on the SAM algorithm and have come across a procedural question that I was hoping you could help clarify.

In your article, you first performed data normalization using do_scran_normalization() and then selected four samples to run the SAM algorithm.

do_scran_normalization("AML916", dir, "GSM3587988_AML916-D0.dem.txt", "GSM3587989_AML916-D0.anno.txt", 5)
AMLcat = sc.read_h5ad(AMLcat_dir+'AMLCat_HSCProg_Full_Hierarchy.h5ad')
sam916 = sc.external.tl.sam(AML916[AML916.obs.CellType.isin(['HSC-like', 'Prog-like'])], standardization="StandardScaler", inplace=True)

However, what has left me puzzled is the process of integrating all the samples. It appears that after initially scaling the data: sc.pp.scale(), you subsequently applied the project_feature_weights() function, which, as far as I understand, performs scaling on the data once again. This redundancy in scaling has raised some questions for me.

AML1012 = load_sample_scran_normalized(AML_dem_anno_dir + "AML1012_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML1012_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML210A = load_sample_scran_normalized(AML_dem_anno_dir + "AML210A_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML210A_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML329 = load_sample_scran_normalized(AML_dem_anno_dir + "AML329_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML329_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML419A = load_sample_scran_normalized(AML_dem_anno_dir + "AML419A_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML419A_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML420B = load_sample_scran_normalized(AML_dem_anno_dir + "AML420B_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML420B_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML475 = load_sample_scran_normalized(AML_dem_anno_dir + "AML475_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML475_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML870 = load_sample_scran_normalized(AML_dem_anno_dir + "AML870_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML870_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML556 = load_sample_scran_normalized(AML_dem_anno_dir + "AML556_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML556_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
sc.pp.scale(AML328)
sc.pp.scale(AML916)
sc.pp.scale(AML921A)
sc.pp.scale(AML1012)
sc.pp.scale(AML210A)
sc.pp.scale(AML329)
sc.pp.scale(AML419A)
sc.pp.scale(AML420B)
sc.pp.scale(AML475)
sc.pp.scale(AML870)
sc.pp.scale(AML707B)
sc.pp.scale(AML556)
AMLcat_scaled = ad.AnnData.concatenate(AML916, AML328, AML921A, AML1012, AML329, AML419A, AML420B, AML210A,
                                     AML475, AML870, AML707B, AML556, 
                                       index_unique=None,
                                     batch_categories =['AML916', 'AML328', 'AML921A', 'AML1012', 'AML329', 'AML419A', 'AML420B', 'AML210A',
                                                        'AML475', 'AML870', 'AML707B', 'AML556'])
project_feature_weights(LSC_3cluster, combined_weights, preprocessing = 'StandardScaler', bbknn = False)
sc.tl.leiden(LSC_3cluster, resolution=0.15, random_state = 42)

In function project_feature_weights includes:

if preprocessing == "StandardScaler":
        X = StandardScaler(with_mean=True).fit_transform(dat)
    elif preprocessing == "Normalizer":
        X = Normalizer().fit_transform(dat)
    else:
        X = dat

After this line of code X = StandardScaler(with_mean=True).fit_transform(dat) is executed, the variable X contains the standardized and centered version of the data in dat. This seems to do the same thing as sc.pp.scale. Doesn't that mean you're standardizing twice on the same normalized data?

I would greatly appreciate it if you could provide some insights or clarification regarding this process. Understanding the rationale behind this approach would be immensely beneficial for me as I continue to explore and learn from your work.

Thank you in advance for your time and assistance. I eagerly await your response and look forward to furthering my understanding of your research.

Sincerely,

andygxzeng / AMLHierarchies

StandardScaler and sc.pp.scale #5