UMAP looks like a line when neighborhood size was determined by using cell type labels

Evenlyeven commented 1 year ago

Thanks for the useful tool!

I noticed that in my results, some areas look like solid lines (for example the cluster at the top in the screenshot below) in the UMAP. I wonder if this is due to that SAM run was set to neighborhood size determined by using cell type labels provided by myself. Does this look normal to you?

And when I check the UMAPs before SAMap stitch them together, they both look "normal" to me. sam1:

sam2:

Also, in my test run, where I didn't use cell type lablels to determine neighborhood size, hopping along each cell's outgoing edges was used instead. The UMAP looks more "normal" to me.

Any comments or suggestions will be highly appreciated!

The script I used is attached below (paths were replaced by ...):

from samap.mapping import SAMAP from samap.analysis import (get_mapping_scores, GenePairFinder, sankey_plot, chord_plot, CellTypeTriangles, ParalogSubstitutions, FunctionalEnrichment, convert_eggnog_to_homologs, GeneTriangles) from samalg import SAM import pandas as pd import anndata from joblib import dump, load

zf_data = anndata.read_h5ad('....') pf_data = anndata.read_h5ad('....')

sam1 = SAM(counts = zf_data) sam1.preprocess_data(filter_genes = False) sam1.run(batch_key = 'orig.ident', npcs = 30)

sam2 = SAM(counts = pf_data) sam2.preprocess_data(filter_genes = False) sam2.run(npcs = 20)

sams = {'zf': sam1, 'pf': sam2}

sm = SAMAP(sams, keys = {'zf': 'cell_type', 'pf': 'cell_type'}, f_maps = '...', save_processed = True)

Thanks very much in advance!

Di

atarashansky commented 1 year ago

Can you give me a sense of how large the cell type labels are? It would be great if you could show me the number of cells assigned to each label.

Evenlyeven commented 1 year ago

Here's tables showing number of cells assigned to each label.

Species zf:

Species pf:

Another question is, would it be the best if the input cell number of different species are comparable? I am working with 200 cells of one species and 8,000 cells of another species, was thinking about downsampling the 8,000 one.

Thank you!!

atarashansky commented 1 year ago

I think SAMap can be robust to dataset size disparities, but I would encourage you to try downsampling and check if the results change. I would also encourage changing the (poorly documented) NHS parameter in SAMAP.run like so:

NHS = {'small_dataset_id': 2, 'big_dataset_id': 3}

NHS controls neighborhood size. 3 means that a cell's neighborhood includes cells up to 3 edges away. 2 decreases the neighborhood size, which is probably good for smaller datasets.

atarashansky commented 1 year ago

Instead of using keys in SAMAP(...),

Can you try using neigh_from_keys in SAMAP.run(...)? You can pass it the same exact value as you're passing to keys.

If you use neigh_from_keys, then NHS is not needed.

Evenlyeven commented 1 year ago

Thanks a lot for your suggestions, I will try it.

atarashansky / SAMap

UMAP looks like a line when neighborhood size was determined by using cell type labels #117