andygxzeng / AMLHierarchies

Analysis notebooks and data for AML hierarchies paper
15 stars 3 forks source link

StandardScaler and sc.pp.scale #5

Closed zhongzheng1999 closed 1 year ago

zhongzheng1999 commented 1 year ago

Hello, I am writing to you as I have been reading your article on the SAM algorithm and have come across a procedural question that I was hoping you could help clarify.

In your article, you first performed data normalization using do_scran_normalization() and then selected four samples to run the SAM algorithm.

do_scran_normalization("AML916", dir, "GSM3587988_AML916-D0.dem.txt", "GSM3587989_AML916-D0.anno.txt", 5)
AMLcat = sc.read_h5ad(AMLcat_dir+'AMLCat_HSCProg_Full_Hierarchy.h5ad')
sam916 = sc.external.tl.sam(AML916[AML916.obs.CellType.isin(['HSC-like', 'Prog-like'])], standardization="StandardScaler", inplace=True)

However, what has left me puzzled is the process of integrating all the samples. It appears that after initially scaling the data: sc.pp.scale(), you subsequently applied the project_feature_weights() function, which, as far as I understand, performs scaling on the data once again. This redundancy in scaling has raised some questions for me.

AML1012 = load_sample_scran_normalized(AML_dem_anno_dir + "AML1012_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML1012_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML210A = load_sample_scran_normalized(AML_dem_anno_dir + "AML210A_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML210A_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML329 = load_sample_scran_normalized(AML_dem_anno_dir + "AML329_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML329_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML419A = load_sample_scran_normalized(AML_dem_anno_dir + "AML419A_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML419A_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML420B = load_sample_scran_normalized(AML_dem_anno_dir + "AML420B_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML420B_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML475 = load_sample_scran_normalized(AML_dem_anno_dir + "AML475_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML475_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML870 = load_sample_scran_normalized(AML_dem_anno_dir + "AML870_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML870_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
AML556 = load_sample_scran_normalized(AML_dem_anno_dir + "AML556_AML_hierarchy_normalized_gem.csv", AML_dem_anno_dir + "AML556_AML_hierarchy_normalized_anno.csv", scale = False, min_cells = 0)
sc.pp.scale(AML328)
sc.pp.scale(AML916)
sc.pp.scale(AML921A)
sc.pp.scale(AML1012)
sc.pp.scale(AML210A)
sc.pp.scale(AML329)
sc.pp.scale(AML419A)
sc.pp.scale(AML420B)
sc.pp.scale(AML475)
sc.pp.scale(AML870)
sc.pp.scale(AML707B)
sc.pp.scale(AML556)
AMLcat_scaled = ad.AnnData.concatenate(AML916, AML328, AML921A, AML1012, AML329, AML419A, AML420B, AML210A,
                                     AML475, AML870, AML707B, AML556, 
                                       index_unique=None,
                                     batch_categories =['AML916', 'AML328', 'AML921A', 'AML1012', 'AML329', 'AML419A', 'AML420B', 'AML210A',
                                                        'AML475', 'AML870', 'AML707B', 'AML556'])
project_feature_weights(LSC_3cluster, combined_weights, preprocessing = 'StandardScaler', bbknn = False)
sc.tl.leiden(LSC_3cluster, resolution=0.15, random_state = 42)

In function project_feature_weights includes:

if preprocessing == "StandardScaler":
        X = StandardScaler(with_mean=True).fit_transform(dat)
    elif preprocessing == "Normalizer":
        X = Normalizer().fit_transform(dat)
    else:
        X = dat

After this line of code X = StandardScaler(with_mean=True).fit_transform(dat) is executed, the variable X contains the standardized and centered version of the data in dat. This seems to do the same thing as sc.pp.scale. Doesn't that mean you're standardizing twice on the same normalized data?

I would greatly appreciate it if you could provide some insights or clarification regarding this process. Understanding the rationale behind this approach would be immensely beneficial for me as I continue to explore and learn from your work.

Thank you in advance for your time and assistance. I eagerly await your response and look forward to furthering my understanding of your research.

Sincerely,

andygxzeng commented 1 year ago

Hi Zhongzheng

Thanks for your note, and thanks for catching this. The key intention was to standardize gene expression counts within LSPCs and across donors. and I can't remember if there was a specific reason for these two rounds of scaling or if this was unintentional however the consequence of standardizing twice should be somewhat minimal. In any case, I would not recommend following this project_feature_weights approach if you are looking into applying SAM to your data.

Back in 2019, SAM was unable to handle multiple samples, and this was the workaround that I had developed to overcome donor-driven variation which would otherwise dominate the final result. SAM now has the option to incorporate harmony correction on the nearest neighbor matrix within each round of the algorithm prior to calculating spatial dispersions of genes. This harmony-based correction is more elegant than the approach that I used in the Nat Med paper and performs better, and this is what we use for new applications of SAM. Just download the new SAM algorithm and specify your donor as the batch_key.

https://github.com/atarashansky/self-assembling-manifold

I hope this helps with your analysis, let me know if you have any other questions

Best wishes, Andy