evanpeikon / GSE_130646

Analysis of single-cell transcriptional profiles in human skeletal muscle
3 stars 0 forks source link

Get different UMAP with similar QC, but not transposing AnnData object #1

Open indapa opened 1 month ago

indapa commented 1 month ago

Hi @evanpeikon

Thank you for sharing this analysis. Interestingly, I get similar numbers of cells and genes, but I don't transpose the AnnData object. I end up with nearly identical numbers of genes and cells, but a vastly different UMAP projection.

Is there something I'm doing incorrectly when not transposing the data?

Concatenate Anndata objects

# convert ea/ samples associated CSV file into AnnData object
file_1 = pd.read_csv('GSM3746212_Muscle_1_Counts.csv', index_col=0)
muscle_1 = sc.AnnData(file_1)

file_2 = pd.read_csv('GSM3746213_Muscle_2_Counts.csv', index_col=0)
muscle_2 = sc.AnnData(file_2)

file_3 = pd.read_csv('GSM3746214_Muscle_3_Counts.csv', index_col=0)
muscle_3 = sc.AnnData(file_3)

file_4 = pd.read_csv('GSM3746215_Muscle_4_Counts.csv', index_col=0)
muscle_4 = sc.AnnData(file_4)

# create list of AnnData objects for ea/ sample
adatas = [muscle_1, muscle_2, muscle_3, muscle_4]

# combine all AnnData objects into a single object
adata_combined = sc.concat(adatas, axis=1, label='sample', keys=['muscle_1', 'muscle_2', 'muscle_3', 'muscle_4'])
adata_combined.var_names_make_unique()

Add some QC metrics for mito, ribo, hemo genes


# mitochondrial genes
adata_combined.var["mt"] = adata_combined.var_names.str.startswith("MT-")
# ribosomal genes
adata_combined.var["ribo"] = adata_combined.var_names.str.startswith(("RPS", "RPL"))
# hemoglobin genes.
adata_combined.var["hb"] = adata_combined.var_names.str.contains(("^HB[^(P)]"))

sc.pp.calculate_qc_metrics(
    adata_combined, qc_vars=["mt", "ribo", "hb"], inplace=True, percent_top=[20], log1p=True
)
adata_combined

Plot n_cells and n_genes by counts

This is the same results in your notebook, but not transposing data

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

sns.histplot(adata_combined.var['n_cells_by_counts'], bins=50, kde=True, ax=ax1)
ax1.set_title('Number of cells by counts')

#It counts the number of genes that have at least one read (or UMI) in each cell.
sns.histplot(adata_combined.obs['n_genes_by_counts'], bins=50, kde=True, ax=ax2)
ax2.set_title('Number of genes by counts')

plt.show()

image

Filter genes and cells

This seems to be pretty much the same as your result


sc.pp.filter_cells(adata_combined, min_genes=20) 
sc.pp.filter_genes(adata_combined, min_cells=200) 

adata_combined

AnnData object with n_obs × n_vars = 10484 × 2871

Normalization,feature selection, dimensionality reduction

# normalization
sc.pp.normalize_total(adata_combined, target_sum=1e4)
sc.pp.log1p(adata_combined)

# feature selection
sc.pp.highly_variable_genes(adata_combined, n_top_genes=2000, subset=True)

# pca
sc.pp.pca(adata_combined, svd_solver="arpack", use_highly_variable=True)

# neighbor graph, umap
sc.pp.neighbors(adata_combined, n_neighbors=10, n_pcs=10)
sc.tl.umap(adata_combined)

# clustering 
sc.tl.leiden(adata_combined)

# visualize clusters on umap
sc.pl.umap(
    adata_combined,
    color=["leiden"],
    legend_loc="on data",
)

image

evanpeikon commented 1 month ago

Hi @indapa, There are a few potential reasons why you're UMAP projection looks different than mine. I've listed three potential reasons below:

1) Not Transposing AnnData Object: Since you're not transposing your AnnData object your observations and variables are flipped. The default format after turning the raw data into an AnnData object is with the rows (observations) being genes and the columns (variables) being cell IDs. After transposing the AnnData object the rows (observations) are cell IDs and the columns (variables) are genes. Since you did not transpose your AnnData object you may be mislaying genes are cells and cells as genes in your chart and filtering criteria.

2) Different Filtering Criteria: It appears that you're using different filtering criteria in your code. For example, you have....

sc.pp.filter_cells(adata_combined, min_genes=20) 
sc.pp.filter_genes(adata_combined, min_cells=200) 

Which filters out cells with fewer than 20 detected genes and filters out genes than appear in fewer than 200 cells. In my code I use the opposite filtering criteria, demonstrated below...

sc.pp.filter_cells(adata_transposed, min_genes=200)
sc.pp.filter_genes(adata_transposed, min_cells=20) 

The code above filters out cells with fewer than 200 detected genes and genes that appear in fewer than 20 cells. Since we used different filtering criteria we end up with different shaped AnnData objects, which will impact our UMAP projections ,

3) Different UMAP Settings: Another potential reason that we have different UMAP projections is that our Leiden algorithm settings are different. For example, in your code you wrote...

sc.tl.leiden(adata_combined)

Wheres in my code (below) I specified my resolution and number of iterations, among other settings...

sc.tl.leiden(adata_transposed, key_added='clusters', resolution=0.5, n_iterations=3, flavor='igraph', directed=False)

These specifications can have significant impacts on the shape, number, and distribution of clusters.

In order to determine which of the factors above are resulting in the differences between our UMAP plots you can run an experiment where you change one variable at a time and see how it impacts your results. For example, you can try transposing your AnnData object without changing any downstream code. Then you can try changing your filtering criteria without altering anything else from the code above, or using the same leiden algorithm settings I used without changing anything else. This will allow you to see how individual changes alter your results, and then you can begin to troubleshoot from there.

It's also worth noting that my UMAP projection isn't the ideal. Various factors can influence how data is clustering, and ultimately we care most about the biological significance of our data. It's possible that you fork my code and change something, and that your clusters better represent distinct cell types than mine. This is why it's important to try and understand the meaning of our results, and even add additional downstream analysis like GO and pathway analysis to see if our results make biological sense.

indapa commented 1 month ago

Hi @evanpeikon - thank you for the timely response. I think the issue is with not transposing vs transposing.

You are right that the filtering criteria is different. When I set those same values of min_genes and min_cells, I get 2187 cells and 2876 genes.

sc.pp.filter_cells(adata_combined, min_genes=200)
sc.pp.filter_genes(adata_combined, min_cells=20)

adata_combined

AnnData object with n_obs × n_vars = 2187 × 2876

Is there a reason to transpose? The default AnnData has the obs being genes and var being cells. Do the downstream API calls like sc.pp.normalize_total(adata_transposed, target_sum=1e4) adjust for difference.

evanpeikon commented 1 month ago

You're welcome. After filtering I get 10485 genes and 2876 cells, then after normalizing my data and limiting my dataset to the 2000 most highly variable genes I end up with 2187 cells and 2000 genes.

The reason I transposed my AnnData object is partly to match convention (it's common practice to have your rows/observations be cell IDs and columns/variables to be genes), and partly because certain analysis tools assume that AnnData objects are formatted this way. Technically, you don't need to transpose the AnnData object, but in some cases you may run into issues, or have to modify existing pipelines to fit the structure of your data - as a result, I tend to try and format my data to fit best practices by default, even if it's not necessary for the specific analysis I'm performing in a certain scenario.