FiaSB1 commented 2 years ago

Hi there,

Pardon me if this post is a bit all over the place, it's my first time posting on Github but hopefully I articulated my issues well enough.

To start off, I am interested in mapping a COPD single cell dataset onto the HLCA. I have been facing repeated errors despite changing the object (which was originally in an R Seurat object) to a .h5ad format and subsetting the data to reduce the file size. Am I able to get a hand with understanding what the issue might be and how I can fix it? Here's the analysis results.


In this notebook, we will guide you through how to map your data to the Human Lung Cell Atlas (core reference), perform label transfer, and more. For that purpose we use scArches, a method to map new single cell/single nucleus data to an existing reference (see also Lotfollahi et al., Nature Biotechnology 2021 https://doi.org/10.1038/s41587-021-01001-7).

Start by setting the directory that has the needed files (model, cell type names, reference embedding etc.) as downloaded from Zenodo @LISA ADD LINK!:

# folder and files
HLCA_folder = "HLCA_scArches"
output_dir = "results"
output_file = "output.h5ad"
# check attached datasets
import fgread
import os
# make sure exactly two datasets are attached
assert len(os.listdir("/fastgenomics/data/")) == 2,"ERROR: There is more than one query dataset attached to this analysis."
# find HLCA and query data folder
HLCA_emb_and_metadata = "/fastgenomics/data/dataset_0001/HLCA_emb_and_metadata.h5ad"
query_path = "/fastgenomics/data/dataset_0002"

if not os.path.exists(HLCA_emb_and_metadata):
    HLCA_emb_and_metadata = "/fastgenomics/data/dataset_0002/HLCA_emb_and_metadata.h5ad"
    query_path = "/fastgenomics/data/dataset_0001"
# make sure there is only one expression data file in query path
df = fgread.ds_info()
expr_data = df.expressionDataFileInfos[df.path==query_path]
assert len(expr_data.values[0]) == 1,"ERROR: There is more than one expression data file in the query data."
test_dataset = query_path + "/" + expr_data.values[0][0].get("name")
expr_data = fgread.ds_info().expressionDataFileInfos[df.path==query_path]
expr_data.values[0]
[{'name': 'onesample.h5ad', 'size': 413581346, 'state': 'Ready'}]
!rm -r /fastgenomics/analysis/$output_dir
!mkdir /fastgenomics/analysis/$output_dir
rm: cannot remove '/fastgenomics/analysis/results': No such file or directory
The required modules are provided with the dockerized software environment. Note that we use scArches version 0.3.5 with scvi-tools 0.8.1. For efficiency of knn-graph and umap calculation, we use scanpy==1.8.2, umap-learn==0.5.2, and pynndescent==0.5.5.

import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)
import scanpy as sc
import anndata
import numpy as np
import pandas as pd
import os
import scarches as sca
import scarches-based label transfer script:

import sys
sys.path.append(HLCA_folder)
import scarches_label_transfer
Set figure parameters (change for bigger/smaller plots):

sc.set_figure_params(figsize=(5,5))
We'll print versions of the imported modules here, for reproducibility purposes:

sc.logging.print_versions()
WARNING: If you miss a compact list, please try `print_header`!
-----
anndata     0.7.5
scanpy      1.8.2
sinfo       0.3.1
-----
PIL                         8.4.0
anndata                     0.7.5
attr                        21.2.0
backcall                    0.2.0
beta_ufunc                  NA
binom_ufunc                 NA
brotli                      NA
cached_property             1.5.2
cairo                       1.20.1
certifi                     2021.10.08
cffi                        1.15.0
charset_normalizer          2.0.9
colorama                    0.4.4
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.5.1
decorator                   5.1.0
defusedxml                  0.7.1
dunamai                     1.6.0
entrypoints                 0.3
fgread                      0.7.7
get_version                 3.5.3
h5py                        3.6.0
idna                        3.1
igraph                      0.8.3
ipykernel                   6.6.0
ipython_genutils            0.2.0
ipywidgets                  7.6.5
jedi                        0.18.1
joblib                      1.1.0
kiwisolver                  1.3.2
leidenalg                   0.8.3
llvmlite                    0.36.0
louvain                     0.6.1
matplotlib                  3.5.1
matplotlib_inline           NA
mpl_toolkits                NA
natsort                     8.0.1
nbinom_ufunc                NA
numba                       0.53.1
numexpr                     2.8.0
numpy                       1.21.4
packaging                   21.3
pandas                      1.2.5
parso                       0.8.3
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
prompt_toolkit              3.0.24
ptyprocess                  0.7.0
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.6.0
pydevd_concurrency_analyser NA
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.10.0
pyparsing                   3.0.6
pytz                        2021.3
requests                    2.26.0
rich                        NA
scanpy                      1.8.2
scarches                    NA
scarches_label_transfer     NA
scipy                       1.7.3
scvi                        0.8.1
seaborn                     0.11.2
setuptools                  59.4.0
sinfo                       0.3.1
six                         1.16.0
sklearn                     1.0.1
socks                       1.7.1
sphinxcontrib               NA
statsmodels                 0.13.1
storemagic                  NA
tables                      3.6.1
texttable                   1.6.4
threadpoolctl               3.0.0
torch                       1.6.0+cu92
tornado                     6.1
tqdm                        4.62.3
traitlets                   5.1.1
typing_extensions           NA
unicodedata2                NA
urllib3                     1.26.7
wcwidth                     0.2.5
zipp                        NA
zmq                         22.3.0
-----
IPython             7.30.1
jupyter_client      7.1.0
jupyter_core        4.9.1
notebook            6.4.6
-----
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0]
Linux-5.4.0-1085-azure-x86_64-with-debian-bullseye-sid
4 logical CPU cores, x86_64
-----
Session information updated at 2022-06-30 11:10

import information about reference:
These files can be downloaded from Zenodo (see beginning of notebook)

# gene order for scArches model
reference_gene_order = pd.read_csv(f"{HLCA_folder}/HLCA_scarches_gene_order.csv")
# reference embedding, including cell/sample/subject metadata:
reference_embedding = sc.read_h5ad(HLCA_emb_and_metadata)
# path to scArches reference model
reference_model_path = HLCA_folder
import your data:
IMPORTANT:
Make sure that you have raw (i.e. un-normalized, integer) counts in your data (.X layer).

Here we use an example dataset, from the Delorey et al. preprint:  https://www.biorxiv.org/content/10.1101/2021.02.25.430130v1. We only use the fresh, single-cell sample from this dataset.

query_data_full = sc.read_h5ad(test_dataset)
def subset_and_pad_adata_object(adata, ref_genes_df, min_n_genes_included=1500):
    """Function to subset a dataset to the 2000 genes used for scArches mapping.
    Missing genes are added and set to 0 counts for all cells."""
    # test if adata.var.index has gene names or ensembl names:
    n_ids_detected = sum(adata.var.index.isin(ref_genes_df.gene_id))
    n_symbols_detected = sum(adata.var.index.isin(ref_genes_df.gene_symbol))
    if max(n_ids_detected, n_symbols_detected) < min_n_genes_included:
        # change column names to lower case
        adata.var.columns = adata.var.columns.str.lower()
        # check if gene names are in another column:
        if "gene_symbols" in adata.var.columns:
            adata.var.index = adata.var.gene_symbol
            n_symbols_detected = sum(adata.var.index.isin(ref_genes_df.gene_symbol))
        elif "gene_ids" in adata.var.columns:
            adata.var.index = adata.var.gene_ids
            n_ids_detected = sum(adata.var.index.isin(ref_genes_df.gene_id))
        # check if anything changed:
        if max(n_ids_detected, n_symbols_detected) < min_n_genes_included:    
            raise ValueError(f"We could detect only {max(n_ids_detected, n_symbols_detected)} genes of the 2000 that we need for the mapping! The minimum overlap is {min_n_genes_included}. Contact the HLCA team for questions. Exiting")
    else:
        if n_ids_detected >= n_symbols_detected:
            gene_type = "gene_id"
            print("Gene names detected: ensembl gene ids.")
            n_genes = n_ids_detected  
        else:
            gene_type = "gene_symbol"
            n_genes = n_symbols_detected
            print("Gene names detected: ensembl gene symbols.")
    genes = adata.var.index[adata.var.index.isin(ref_genes_df[gene_type])].tolist()
    # if not all genes are included, pad:
    if n_genes > 2000:
        raise ValueError("Your gene names appear not to be unique, something must be wrong. Exiting.")
    print(f"{n_genes} genes detected out of 2000 used for mapping.")
    # Subset adata object
    adata_sub = adata[:,genes].copy()
    # Pad object with 0 genes if needed
    if n_genes < 2000:
        diff = 2000 - n_genes
        print(f'Not all genes were recovered, filling in zeros for {diff} missing genes...')
        # Genes to pad with
        genes_to_add = set(ref_genes_df[gene_type].values).difference(set(adata_sub.var_names))
        df_padding = pd.DataFrame(data=np.zeros((adata_sub.shape[0],len(genes_to_add))), index=adata_sub.obs_names, columns=genes_to_add)
        adata_padding = sc.AnnData(df_padding)
        # Concatenate object
        adata_sub = anndata.concat([adata_sub, adata_padding], axis=1, join='outer', index_unique=None, merge='unique')
        # and order:
        adata_sub = adata_sub[:,ref_genes_df[gene_type]].copy()
    return adata_sub
query_data = subset_and_pad_adata_object(query_data_full, reference_gene_order)
Gene names detected: ensembl gene symbols.
1861 genes detected out of 2000 used for mapping.
Not all genes were recovered, filling in zeros for 139 missing genes...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_9/2669176631.py in <module>
----> 1 query_data = subset_and_pad_adata_object(query_data_full, reference_gene_order)

/tmp/ipykernel_9/2768654640.py in subset_and_pad_adata_object(adata, ref_genes_df, min_n_genes_included)
     43         adata_padding = sc.AnnData(df_padding)
     44         # Concatenate object
---> 45         adata_sub = anndata.concat([adata_sub, adata_padding], axis=1, join='outer', index_unique=None, merge='unique')
     46         # and order:
     47         adata_sub = adata_sub[:,ref_genes_df[gene_type]].copy()

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/merge.py in concat(adatas, axis, join, merge, uns_merge, label, keys, index_unique, fill_value, pairwise)
    905             f"{alt_dim}p": alt_pairwise,
    906             "uns": uns,
--> 907             "raw": raw,
    908         }
    909     )

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/anndata.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
    319                 varp=varp,
    320                 filename=filename,
--> 321                 filemode=filemode,
    322             )
    323 

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/anndata.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode)
    508         # TODO: Think about consequences of making obsm a group in hdf
    509         self._obsm = AxisArrays(self, 0, vals=convert_to_dict(obsm))
--> 510         self._varm = AxisArrays(self, 1, vals=convert_to_dict(varm))
    511 
    512         self._obsp = PairwiseArrays(self, 0, vals=convert_to_dict(obsp))

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/aligned_mapping.py in __init__(self, parent, axis, vals)
    230         self._data = dict()
    231         if vals is not None:
--> 232             self.update(vals)
    233 
    234 

/opt/conda/envs/scarches_0_3_5/lib/python3.7/_collections_abc.py in update(*args, **kwds)
    839             if isinstance(other, Mapping):
    840                 for key in other:
--> 841                     self[key] = other[key]
    842             elif hasattr(other, "keys"):
    843                 for key in other.keys():

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/aligned_mapping.py in __setitem__(self, key, value)
    149 
    150     def __setitem__(self, key: str, value: V):
--> 151         value = self._validate_value(value, key)
    152         self._data[key] = value
    153 

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
    213                 f"value.index does not match parent’s axis {self.axes[0]} names"
    214             )
--> 215         return super()._validate_value(val, key)
    216 
    217 

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
     51                 right_shape = tuple(self.parent.shape[a] for a in self.axes)
     52                 raise ValueError(
---> 53                     f"Value passed for key {key!r} is of incorrect shape. "
     54                     f"Values of {self.attrname} must match dimensions "
     55                     f"{self.axes} of parent. Value had shape {val.shape} while "

ValueError: Value passed for key 'PCs' is of incorrect shape. Values of varm must match dimensions (1,) of parent. Value had shape (10305, 50) while it should have had (2000,).
check gene order:

if (query_data.var.index == reference_gene_order.gene_symbol).all() or (
    query_data.var.index == reference_gene_order.gene_id
).all():
    print("Gene order is correct.")
else:
    print(
        "WARNING: your gene order does not match the order of the HLCA reference. Fix this before continuing!"
    )
Prepare query data for mapping:
Make sure that you have raw data in query_data.X. We will now copy that information to query_data.raw for scArches.

query_data.raw = query_data
raw = query_data.raw.to_adata()
raw.X = query_data.X
query_data.raw = raw
quick check if X and raw.X have integers (do a more systematic check if you have any doubts!):

query_data.raw.X[:10, :10].toarray()
query_data.X[:10, :10].toarray()
Set query_data.obs["scanvi_label"] to "unlabeled". Keep this code as is, it has to do with the way the reference model was changed.

query_data.obs["scanvi_label"] = "unlabeled"
Perform surgery on reference model and train on query dataset without cell type labels
Set scArches parameters:
Store the set of batches from your query data. This might just be a single batch.
In the extended HLCA, we have treated datasets as batches. If you expect large batch effects between subsets of your data, it is better to spit your dataset into multiple datasets. For example, if you have data that is a mixture of 5' and 3' data, it would be better to consider those separate batches.

batch_variable = "dataset"  # the column name under which you stored your batch variable
query_batches = sorted(query_data.obs[batch_variable].unique())
print(query_batches)
These are the parameter settings we used for the extended HLCA, as recommended by the scArches developers:

surgery_epochs = 500
early_stopping_kwargs_surgery = {
    "early_stopping_metric": "elbo",
    "save_best_state_metric": "elbo",
    "on": "full_dataset",
    "patience": 10,
    "threshold": 0.001,
    "reduce_lr_on_plateau": True,
    "lr_patience": 8,
    "lr_factor": 0.1,
}
set the output directory. The output model will be stored here, under the name of the batch. It will over-write any existing model with the same name!

# output_dir = "testout"
Now perform scArches "surgery". The result will be stored in the output dir specified above. Each dataset/batch has its own "adapter" that will be trained separately, and will be stored in the output_dir under a folder named after the batch.
Note: if you use gene names rather than ensembl IDs in your query_data.var.index, you will get a warning that your genes (var_names) do not match the training genes. You can ignore this warning, as long as you have done the gene check in the beginning of the notebook.

for batch in query_batches: # this loop is only necessary if you have multiple batches, but will also work for a single batch.
    print("Batch:", batch)
    query_subadata = query_data[query_data.obs[batch_variable] == batch,:].copy()
    print("Shape:", query_subadata.shape)
    # load model and set relevant variables:
    model = sca.models.SCANVI.load_query_data(
        query_subadata,
        reference_model_path,
        freeze_dropout = True,
    )
    model._unlabeled_indices = np.arange(query_subadata.n_obs)
    model._labeled_indices = []
    # now train surgery model using reference model and target adata
    model.train(
        n_epochs_semisupervised=surgery_epochs,
        train_base_model=False,
        semisupervised_trainer_kwargs=dict(
            metrics_to_monitor=["accuracy", "elbo"], 
            weight_decay=0,
            early_stopping_kwargs=early_stopping_kwargs_surgery
        ),
        frequency=1
    )
    surgery_path = f"{output_dir}/{batch}"
    if not os.path.exists(surgery_path):
        os.makedirs(surgery_path)
    model.save(surgery_path, overwrite=True)
to reload specific models, use the following code (not necessary if you run this script in one go)

# surgery_path = f"{output_dir}/{batch}"
# model = sca.models.SCANVI.load_query_data(query_data, 'surgery_model', freeze_dropout=True) # this is before training
# surgery_path = f"{output_dir}/{batch}"
# model = sca.models.SCANVI.load(surgery_path,query_data) # if already trained
Get the latent representation of your query dataset
Here we will calculate the "latent representation", or "low-dimensional embedding" of your dataset. This embedding is in the same space as the HLCA embedding that you loaded in the beginning of the script. Hence, we can combine the two embeddings afterwards (HLCA + your new data), and do joint clustering, UMAP embedding, label transfer etc.!

Create an empty dataframe to store the embeddings of all batches:

emb_df = pd.DataFrame(index=query_data.obs.index,columns=range(0,reference_embedding.shape[1]))
Calculate embeddings, using the trained scArches model:

for batch in query_batches: # from small to large datasets
    print(f"Working on {batch}...")
    query_subadata = query_data[query_data.obs[batch_variable] == batch,:].copy()
    surgery_path = f"{output_dir}/{batch}"
    model = sca.models.SCANVI.load(surgery_path, query_subadata)
    query_subadata_latent = sc.AnnData(model.get_latent_representation(adata=query_subadata))
    # copy over .obs
    query_subadata_latent.obs = query_data.obs.loc[query_subadata.obs.index,:]
    query_subadata_latent.write(f"{output_dir}/{batch}/emb.h5ad")
    emb_df.loc[query_subadata.obs.index,:] = query_subadata_latent.X
Convert final embedding (of all your query batches) to an anndata object:

query_embedding = sc.AnnData(X=emb_df.values, obs=query_data.obs)
Combine the embedding of the HLCA and the query data
Add query vs. reference information:

query_embedding.obs['HLCA_or_query'] = "query"
reference_embedding.obs['HLCA_or_query'] = "HLCA"
We will now combine the two embeddings to enable joing clustering etc. If you expect non-unique barcodes (.obs index), set indexunique to e.g. "" and batch_key to the obs column that you want to use as barcode suffix (e.g. "dataset").

# concatenate source and target data
combined_emb = reference_embedding.concatenate(query_embedding, index_unique=None) # index_unique="_", batch_key="dataset") # alternative
Note that if wanted, this embedding can be added to the full, joint HLCA + query data object (including gene counts). The full HLCA, including normalized counts is publicly available. For now, we will just work with the embedding, since that is all we need to perform joint plotting and label transfer.

Downstream analysis: plotting and label transfer
umap plotting:
We can now do joint downstream analysis, such as clustering, UMAP embedding, label transfer etc.

Note that the knngraph + umap calculation can take quite a while. See notes at the top of this notebook for speeding up the process.

sc.pp.neighbors(combined_emb, n_neighbors=30)
sc.tl.umap(combined_emb)
Now we can start plotting umaps, here is just one example:

sc.pl.umap(combined_emb, color="HLCA_or_query", save="_HLCA_and_query_cells_from_query.png")
Store result, if wanted:

# combined_emb.write(f"{output_dir}/{output_file}")
Any other graph-based or latent-space based downstream method can also be used, based on the embedding and the above-calculated graph. We will proceed to label transfer here.

Label transfer:
Next, we use a knn classifier to transfer the lables from the reference to the query. We do this for every level of the annotation (i.e. level 1-5).
Note that some cell types don't have annotations for higher levels, e.g. mast cells do not have level 4 or 5 annotations. For those cell types, we "propagate" to the higher levels, i.e. you will see "3_Mast cells" in level 4 and 5 annotations. (Most cell types don't have a level 5 annotation!) Therefore, all highest level annotations can be found under level 5.

Import the set of finest cell type labels, and their matching lower-level annotations (cell types are also ordered in a biologically sensible order in this table, you can use this order for downstream plotting etc. if wanted):

cts_ordered = pd.read_csv(f"{HLCA_folder}/HLCA_celltypes_ordered.csv",index_col=0)    
Now run the label transfer commands. Note that this might take quite a while if you have a large query dataset! For our small test dataset, it should not take long.

# run k-neighbors transformer
k_neighbors_transformer = scarches_label_transfer.weighted_knn_trainer(
    train_adata=reference_embedding,
    train_adata_emb="X", # location of our joint embedding
    n_neighbors=50,
    )    
# perform label transfer
labels, uncert = scarches_label_transfer.weighted_knn_transfer(
    query_adata=query_embedding,
    query_adata_emb="X", # location of our joint embedding
    label_keys="Level",
    knn_model=k_neighbors_transformer,
    ref_adata_obs = reference_embedding.obs.join(cts_ordered, on='ann_finest_level')
    )
With the commands above, we labeled every cell from the query. However, some cells might have high label transfer uncertainty. It is useful to set those to "unknown" instead of giving them a cell type label. This will help highlight cell types/states that are new (i.e. not present in the reference) and possible interesting, they're worth taking a careful look at!

This is the uncertainty threshold we use in the paper, limiting the false positive rate to <0.5:

uncertainty_threshold = 0.2
We will now copy the unfiltered cell type labels ("Level_[level]_transfered_labelunfiltered"), the uncertainty scores ("Level[level]_transferuncert") and the filtered cell type labels (i.e. including "unknown", "Level[level]_transfered_label") to our combined_emb object.

labels.rename(columns={f"Level_{lev}":f"Level_{lev}_transfered_label_unfiltered" for lev in range(1,6)},inplace=True)
uncert.rename(columns={f"Level_{lev}":f"Level_{lev}_transfer_uncert" for lev in range(1,6)},inplace=True)
combined_emb.obs = combined_emb.obs.join(labels)
combined_emb.obs = combined_emb.obs.join(uncert)
# convert to arrays instead of categoricals, and set "nan" to NaN 
for col in combined_emb.obs.columns:
    if col.endswith("_transfer_uncert"):
        combined_emb.obs[col] = list(np.array(combined_emb.obs[col]))
    elif col.endswith("_transfered_label_unfiltered"):
        filtered_colname = col.replace("_unfiltered","")
        matching_uncert_col = col.replace("transfered_label_unfiltered","transfer_uncert")

        # also create a filtered version, setting cells with too high 
        # uncertainty levels to "Unknown"
        combined_emb.obs[filtered_colname] = combined_emb.obs[col]
        combined_emb.obs[filtered_colname].loc[combined_emb.obs[matching_uncert_col]>uncertainty_threshold] = "Unknown"
        # convert to categorical:
        combined_emb.obs[col] = pd.Categorical(combined_emb.obs[col])
        combined_emb.obs[filtered_colname] = pd.Categorical(combined_emb.obs[filtered_colname])
        # then replace "nan" with NaN (that makes colors better in umap)
        combined_emb.obs[col].replace("nan",np.nan,inplace=True)
        combined_emb.obs[filtered_colname].replace("nan",np.nan,inplace=True)
Let's take a look at the percentage of cells set to "unknown" after our filtering:

print(f"Percentage of unknown per level, with uncertainty_threshold={uncertainty_threshold}:")
for level in range(1,6):
    print(f"Level {level}: {np.round(sum(combined_emb.obs[f'Level_{level}_transfered_label'] =='Unknown')/query_data.n_obs*100,2)}%")
Here we show label transfer uncertainties per level. Regions with high uncertainty can highlight interesting cell types/states, not present in the reference. Note that uncertainties will get higher, the more detailed we go:

sc.pl.umap(combined_emb,color=[f"Level_{lev}_transfer_uncert" for lev in range(1,6)],ncols=5, save="_HLCA_and_query_lab_transf_uncert.png")
Now let's take a look at the transfered labels, at every level. Note that the color for "Unknown" switches per plot.

For cellxgene visualization to work, we need to set our np.nans to "nan" strings, so we'll do that here:

for level in range(1,6):
    combined_emb.obs[f"Level_{level}_transfered_label"] = pd.Categorical(
        pd.Series(combined_emb.obs[f"Level_{level}_transfered_label"].tolist()).fillna("nan")
    )
    if f"Level_{level}_transfered_label_colors" in combined_emb.uns.keys():
        print("removing colors for level", level)
        del combined_emb.uns[f"Level_{level}_transfered_label_colors"]
Now plot:

sc.pl.umap(combined_emb,color=[f"Level_{lev}_transfered_label" for lev in range(1,6)],na_color="grey",ncols=1,size=0.5, save="_HLCA_and_query_transf_labels.png")
For your reference, these are the annotations of the reference atlas:

sc.pl.umap(combined_emb, color="ann_finest_level", save="_HLCA_and_query_ct_annotations.png")
combined_emb.write(f"{output_dir}/{output_file}")
Store uncertainty summary stats for sharing with HLCA (i.e. 25th, 50th and 75th uncertainty percentile for every transfered label at every level):

min_n_cells = 20
### loop through batches
for batch in query_batches:
    emb_batch = combined_emb[combined_emb.obs[batch_variable] == batch,:].copy()
    # initiate empty dataframe:
    ds_uncert_stats = pd.DataFrame(index=range(len(emb_batch.obs.Level_5_transfered_label_unfiltered.unique())))
    # loop through levels, and store n cells, as well as 25th, 50th and 75th percentile uncertainty for every transfered label:
    for lev in range(1,6):
        lev_cts = emb_batch.obs[f"Level_{lev}_transfered_label_unfiltered"].unique()
        # group by transfered label
        g = emb_batch.obs.groupby(f"Level_{lev}_transfered_label_unfiltered")
        # count number of cells per label
        lev_cts_n = g.agg({f"Level_{lev}_transfered_label_unfiltered":"count"}).loc[lev_cts,:]
        # remove all labels with fewer than min_n_cells
        lev_cts_filt = lev_cts_n.index[lev_cts_n[f"Level_{lev}_transfered_label_unfiltered"]>=min_n_cells].tolist()
        # now store relevant stats
        ds_uncert_stats.loc[range(0,len(lev_cts_filt)),f"Level_{lev}_label"] = lev_cts_filt
        for perc in [25,50,75]:
            ds_uncert_stats.loc[range(0,len(lev_cts_filt)),f"Level_{lev}_uncert_{perc}"] = g[f'Level_{lev}_transfer_uncert'].apply(lambda x: np.percentile(x, perc)).loc[lev_cts_filt].values
    # remove empty rows
    ds_uncert_stats.dropna(axis=0,how="all",inplace=True)
    # round to three decimals:
    ds_uncert_stats = round(ds_uncert_stats,3)
    ds_uncert_stats.to_csv(f"{output_dir}/summary_stats_{batch}.csv")
Move figures folder to output directory.

!mv /fastgenomics/analysis/figures/* $output_dir/.
!rm -r /fastgenomics/analysis/figures
!touch /fastgenomics/analysis/$output_dir/.success
 "

Afterwards, I realise I can just prepare an anndata from scratch with the rawcounts instead of preparing a seurat object first then converting it to anndata. However, I still received errors in the analysis. Below is the analysis results.


In this notebook, we will guide you through how to map your data to the Human Lung Cell Atlas (core reference), perform label transfer, and more. For that purpose we use scArches, a method to map new single cell/single nucleus data to an existing reference (see also Lotfollahi et al., Nature Biotechnology 2021 https://doi.org/10.1038/s41587-021-01001-7).

Start by setting the directory that has the needed files (model, cell type names, reference embedding etc.) as downloaded from Zenodo @LISA ADD LINK!:

# folder and files
HLCA_folder = "HLCA_scArches"
output_dir = "results"
output_file = "output.h5ad"
# check attached datasets
import fgread
import os
# make sure exactly two datasets are attached
assert len(os.listdir("/fastgenomics/data/")) == 2,"ERROR: There is more than one query dataset attached to this analysis."
# find HLCA and query data folder
HLCA_emb_and_metadata = "/fastgenomics/data/dataset_0001/HLCA_emb_and_metadata.h5ad"
query_path = "/fastgenomics/data/dataset_0002"

if not os.path.exists(HLCA_emb_and_metadata):
    HLCA_emb_and_metadata = "/fastgenomics/data/dataset_0002/HLCA_emb_and_metadata.h5ad"
    query_path = "/fastgenomics/data/dataset_0001"
# make sure there is only one expression data file in query path
df = fgread.ds_info()
expr_data = df.expressionDataFileInfos[df.path==query_path]
assert len(expr_data.values[0]) == 1,"ERROR: There is more than one expression data file in the query data."
test_dataset = query_path + "/" + expr_data.values[0][0].get("name")
expr_data = fgread.ds_info().expressionDataFileInfos[df.path==query_path]
expr_data.values[0]
[{'name': 'counts.h5ad', 'size': 5573662528, 'state': 'Ready'}]
!rm -r /fastgenomics/analysis/$output_dir
!mkdir /fastgenomics/analysis/$output_dir
rm: cannot remove '/fastgenomics/analysis/results': No such file or directory
The required modules are provided with the dockerized software environment. Note that we use scArches version 0.3.5 with scvi-tools 0.8.1. For efficiency of knn-graph and umap calculation, we use scanpy==1.8.2, umap-learn==0.5.2, and pynndescent==0.5.5.

import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)
import scanpy as sc
import anndata
import numpy as np
import pandas as pd
import os
import scarches as sca
import scarches-based label transfer script:

import sys
sys.path.append(HLCA_folder)
import scarches_label_transfer
Set figure parameters (change for bigger/smaller plots):

sc.set_figure_params(figsize=(5,5))
We'll print versions of the imported modules here, for reproducibility purposes:

sc.logging.print_versions()
WARNING: If you miss a compact list, please try `print_header`!
-----
anndata     0.7.5
scanpy      1.8.2
sinfo       0.3.1
-----
PIL                         8.4.0
anndata                     0.7.5
attr                        21.2.0
backcall                    0.2.0
beta_ufunc                  NA
binom_ufunc                 NA
brotli                      NA
cached_property             1.5.2
cairo                       1.20.1
certifi                     2021.10.08
cffi                        1.15.0
charset_normalizer          2.0.9
colorama                    0.4.4
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.5.1
decorator                   5.1.0
defusedxml                  0.7.1
dunamai                     1.6.0
entrypoints                 0.3
fgread                      0.7.7
get_version                 3.5.3
h5py                        3.6.0
idna                        3.1
igraph                      0.8.3
ipykernel                   6.6.0
ipython_genutils            0.2.0
ipywidgets                  7.6.5
jedi                        0.18.1
joblib                      1.1.0
kiwisolver                  1.3.2
leidenalg                   0.8.3
llvmlite                    0.36.0
louvain                     0.6.1
matplotlib                  3.5.1
matplotlib_inline           NA
mpl_toolkits                NA
natsort                     8.0.1
nbinom_ufunc                NA
numba                       0.53.1
numexpr                     2.8.0
numpy                       1.21.4
packaging                   21.3
pandas                      1.2.5
parso                       0.8.3
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
prompt_toolkit              3.0.24
ptyprocess                  0.7.0
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.6.0
pydevd_concurrency_analyser NA
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.10.0
pyparsing                   3.0.6
pytz                        2021.3
requests                    2.26.0
rich                        NA
scanpy                      1.8.2
scarches                    NA
scarches_label_transfer     NA
scipy                       1.7.3
scvi                        0.8.1
seaborn                     0.11.2
setuptools                  59.4.0
sinfo                       0.3.1
six                         1.16.0
sklearn                     1.0.1
socks                       1.7.1
sphinxcontrib               NA
statsmodels                 0.13.1
storemagic                  NA
tables                      3.6.1
texttable                   1.6.4
threadpoolctl               3.0.0
torch                       1.6.0+cu92
tornado                     6.1
tqdm                        4.62.3
traitlets                   5.1.1
typing_extensions           NA
unicodedata2                NA
urllib3                     1.26.7
wcwidth                     0.2.5
zipp                        NA
zmq                         22.3.0
-----
IPython             7.30.1
jupyter_client      7.1.0
jupyter_core        4.9.1
notebook            6.4.6
-----
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0]
Linux-5.4.0-1072-azure-x86_64-with-debian-bullseye-sid
4 logical CPU cores, x86_64
-----
Session information updated at 2022-07-06 04:24

import information about reference:
These files can be downloaded from Zenodo (see beginning of notebook)

# gene order for scArches model
reference_gene_order = pd.read_csv(f"{HLCA_folder}/HLCA_scarches_gene_order.csv")
# reference embedding, including cell/sample/subject metadata:
reference_embedding = sc.read_h5ad(HLCA_emb_and_metadata)
# path to scArches reference model
reference_model_path = HLCA_folder
import your data:
IMPORTANT:
Make sure that you have raw (i.e. un-normalized, integer) counts in your data (.X layer).

Here we use an example dataset, from the Delorey et al. preprint:  https://www.biorxiv.org/content/10.1101/2021.02.25.430130v1. We only use the fresh, single-cell sample from this dataset.

query_data_full = sc.read_h5ad(test_dataset)
def subset_and_pad_adata_object(adata, ref_genes_df, min_n_genes_included=1500):
    """Function to subset a dataset to the 2000 genes used for scArches mapping.
    Missing genes are added and set to 0 counts for all cells."""
    # test if adata.var.index has gene names or ensembl names:
    n_ids_detected = sum(adata.var.index.isin(ref_genes_df.gene_id))
    n_symbols_detected = sum(adata.var.index.isin(ref_genes_df.gene_symbol))
    if max(n_ids_detected, n_symbols_detected) < min_n_genes_included:
        # change column names to lower case
        adata.var.columns = adata.var.columns.str.lower()
        # check if gene names are in another column:
        if "gene_symbols" in adata.var.columns:
            adata.var.index = adata.var.gene_symbol
            n_symbols_detected = sum(adata.var.index.isin(ref_genes_df.gene_symbol))
        elif "gene_ids" in adata.var.columns:
            adata.var.index = adata.var.gene_ids
            n_ids_detected = sum(adata.var.index.isin(ref_genes_df.gene_id))
        # check if anything changed:
        if max(n_ids_detected, n_symbols_detected) < min_n_genes_included:    
            raise ValueError(f"We could detect only {max(n_ids_detected, n_symbols_detected)} genes of the 2000 that we need for the mapping! The minimum overlap is {min_n_genes_included}. Contact the HLCA team for questions. Exiting")
    else:
        if n_ids_detected >= n_symbols_detected:
            gene_type = "gene_id"
            print("Gene names detected: ensembl gene ids.")
            n_genes = n_ids_detected  
        else:
            gene_type = "gene_symbol"
            n_genes = n_symbols_detected
            print("Gene names detected: ensembl gene symbols.")
    genes = adata.var.index[adata.var.index.isin(ref_genes_df[gene_type])].tolist()
    # if not all genes are included, pad:
    if n_genes > 2000:
        raise ValueError("Your gene names appear not to be unique, something must be wrong. Exiting.")
    print(f"{n_genes} genes detected out of 2000 used for mapping.")
    # Subset adata object
    adata_sub = adata[:,genes].copy()
    # Pad object with 0 genes if needed
    if n_genes < 2000:
        diff = 2000 - n_genes
        print(f'Not all genes were recovered, filling in zeros for {diff} missing genes...')
        # Genes to pad with
        genes_to_add = set(ref_genes_df[gene_type].values).difference(set(adata_sub.var_names))
        df_padding = pd.DataFrame(data=np.zeros((adata_sub.shape[0],len(genes_to_add))), index=adata_sub.obs_names, columns=genes_to_add)
        adata_padding = sc.AnnData(df_padding)
        # Concatenate object
        adata_sub = anndata.concat([adata_sub, adata_padding], axis=1, join='outer', index_unique=None, merge='unique')
        # and order:
        adata_sub = adata_sub[:,ref_genes_df[gene_type]].copy()
    return adata_sub
query_data = subset_and_pad_adata_object(query_data_full, reference_gene_order)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_9/2669176631.py in <module>
----> 1 query_data = subset_and_pad_adata_object(query_data_full, reference_gene_order)

/tmp/ipykernel_9/2768654640.py in subset_and_pad_adata_object(adata, ref_genes_df, min_n_genes_included)
     17         # check if anything changed:
     18         if max(n_ids_detected, n_symbols_detected) < min_n_genes_included:
---> 19             raise ValueError(f"We could detect only {max(n_ids_detected, n_symbols_detected)} genes of the 2000 that we need for the mapping! The minimum overlap is {min_n_genes_included}. Contact the HLCA team for questions. Exiting")
     20     else:
     21         if n_ids_detected >= n_symbols_detected:

ValueError: We could detect only 0 genes of the 2000 that we need for the mapping! The minimum overlap is 1500. Contact the HLCA team for questions. Exiting
check gene order:

if (query_data.var.index == reference_gene_order.gene_symbol).all() or (
    query_data.var.index == reference_gene_order.gene_id
).all():
    print("Gene order is correct.")
else:
    print(
        "WARNING: your gene order does not match the order of the HLCA reference. Fix this before continuing!"
    )
Prepare query data for mapping:
Make sure that you have raw data in query_data.X. We will now copy that information to query_data.raw for scArches.

query_data.raw = query_data
raw = query_data.raw.to_adata()
raw.X = query_data.X
query_data.raw = raw
quick check if X and raw.X have integers (do a more systematic check if you have any doubts!):

query_data.raw.X[:10, :10].toarray()
query_data.X[:10, :10].toarray()
Set query_data.obs["scanvi_label"] to "unlabeled". Keep this code as is, it has to do with the way the reference model was changed.

query_data.obs["scanvi_label"] = "unlabeled"
Perform surgery on reference model and train on query dataset without cell type labels
Set scArches parameters:
Store the set of batches from your query data. This might just be a single batch.
In the extended HLCA, we have treated datasets as batches. If you expect large batch effects between subsets of your data, it is better to spit your dataset into multiple datasets. For example, if you have data that is a mixture of 5' and 3' data, it would be better to consider those separate batches.

batch_variable = "dataset"  # the column name under which you stored your batch variable
query_batches = sorted(query_data.obs[batch_variable].unique())
print(query_batches)
These are the parameter settings we used for the extended HLCA, as recommended by the scArches developers:

surgery_epochs = 500
early_stopping_kwargs_surgery = {
    "early_stopping_metric": "elbo",
    "save_best_state_metric": "elbo",
    "on": "full_dataset",
    "patience": 10,
    "threshold": 0.001,
    "reduce_lr_on_plateau": True,
    "lr_patience": 8,
    "lr_factor": 0.1,
}
set the output directory. The output model will be stored here, under the name of the batch. It will over-write any existing model with the same name!

# output_dir = "testout"
Now perform scArches "surgery". The result will be stored in the output dir specified above. Each dataset/batch has its own "adapter" that will be trained separately, and will be stored in the output_dir under a folder named after the batch.
Note: if you use gene names rather than ensembl IDs in your query_data.var.index, you will get a warning that your genes (var_names) do not match the training genes. You can ignore this warning, as long as you have done the gene check in the beginning of the notebook.

for batch in query_batches: # this loop is only necessary if you have multiple batches, but will also work for a single batch.
    print("Batch:", batch)
    query_subadata = query_data[query_data.obs[batch_variable] == batch,:].copy()
    print("Shape:", query_subadata.shape)
    # load model and set relevant variables:
    model = sca.models.SCANVI.load_query_data(
        query_subadata,
        reference_model_path,
        freeze_dropout = True,
    )
    model._unlabeled_indices = np.arange(query_subadata.n_obs)
    model._labeled_indices = []
    # now train surgery model using reference model and target adata
    model.train(
        n_epochs_semisupervised=surgery_epochs,
        train_base_model=False,
        semisupervised_trainer_kwargs=dict(
            metrics_to_monitor=["accuracy", "elbo"], 
            weight_decay=0,
            early_stopping_kwargs=early_stopping_kwargs_surgery
        ),
        frequency=1
    )
    surgery_path = f"{output_dir}/{batch}"
    if not os.path.exists(surgery_path):
        os.makedirs(surgery_path)
    model.save(surgery_path, overwrite=True)
to reload specific models, use the following code (not necessary if you run this script in one go)

# surgery_path = f"{output_dir}/{batch}"
# model = sca.models.SCANVI.load_query_data(query_data, 'surgery_model', freeze_dropout=True) # this is before training
# surgery_path = f"{output_dir}/{batch}"
# model = sca.models.SCANVI.load(surgery_path,query_data) # if already trained
Get the latent representation of your query dataset
Here we will calculate the "latent representation", or "low-dimensional embedding" of your dataset. This embedding is in the same space as the HLCA embedding that you loaded in the beginning of the script. Hence, we can combine the two embeddings afterwards (HLCA + your new data), and do joint clustering, UMAP embedding, label transfer etc.!

Create an empty dataframe to store the embeddings of all batches:

emb_df = pd.DataFrame(index=query_data.obs.index,columns=range(0,reference_embedding.shape[1]))
Calculate embeddings, using the trained scArches model:

for batch in query_batches: # from small to large datasets
    print(f"Working on {batch}...")
    query_subadata = query_data[query_data.obs[batch_variable] == batch,:].copy()
    surgery_path = f"{output_dir}/{batch}"
    model = sca.models.SCANVI.load(surgery_path, query_subadata)
    query_subadata_latent = sc.AnnData(model.get_latent_representation(adata=query_subadata))
    # copy over .obs
    query_subadata_latent.obs = query_data.obs.loc[query_subadata.obs.index,:]
    query_subadata_latent.write(f"{output_dir}/{batch}/emb.h5ad")
    emb_df.loc[query_subadata.obs.index,:] = query_subadata_latent.X
Convert final embedding (of all your query batches) to an anndata object:

query_embedding = sc.AnnData(X=emb_df.values, obs=query_data.obs)
Combine the embedding of the HLCA and the query data
Add query vs. reference information:

query_embedding.obs['HLCA_or_query'] = "query"
reference_embedding.obs['HLCA_or_query'] = "HLCA"
We will now combine the two embeddings to enable joing clustering etc. If you expect non-unique barcodes (.obs index), set indexunique to e.g. "" and batch_key to the obs column that you want to use as barcode suffix (e.g. "dataset").

# concatenate source and target data
combined_emb = reference_embedding.concatenate(query_embedding, index_unique=None) # index_unique="_", batch_key="dataset") # alternative
Note that if wanted, this embedding can be added to the full, joint HLCA + query data object (including gene counts). The full HLCA, including normalized counts is publicly available. For now, we will just work with the embedding, since that is all we need to perform joint plotting and label transfer.

Downstream analysis: plotting and label transfer
umap plotting:
We can now do joint downstream analysis, such as clustering, UMAP embedding, label transfer etc.

Note that the knngraph + umap calculation can take quite a while. See notes at the top of this notebook for speeding up the process.

sc.pp.neighbors(combined_emb, n_neighbors=30)
sc.tl.umap(combined_emb)
Now we can start plotting umaps, here is just one example:

sc.pl.umap(combined_emb, color="HLCA_or_query", save="_HLCA_and_query_cells_from_query.png")
Store result, if wanted:

# combined_emb.write(f"{output_dir}/{output_file}")
Any other graph-based or latent-space based downstream method can also be used, based on the embedding and the above-calculated graph. We will proceed to label transfer here.

Label transfer:
Next, we use a knn classifier to transfer the lables from the reference to the query. We do this for every level of the annotation (i.e. level 1-5).
Note that some cell types don't have annotations for higher levels, e.g. mast cells do not have level 4 or 5 annotations. For those cell types, we "propagate" to the higher levels, i.e. you will see "3_Mast cells" in level 4 and 5 annotations. (Most cell types don't have a level 5 annotation!) Therefore, all highest level annotations can be found under level 5.

Import the set of finest cell type labels, and their matching lower-level annotations (cell types are also ordered in a biologically sensible order in this table, you can use this order for downstream plotting etc. if wanted):

cts_ordered = pd.read_csv(f"{HLCA_folder}/HLCA_celltypes_ordered.csv",index_col=0)    
Now run the label transfer commands. Note that this might take quite a while if you have a large query dataset! For our small test dataset, it should not take long.

# run k-neighbors transformer
k_neighbors_transformer = scarches_label_transfer.weighted_knn_trainer(
    train_adata=reference_embedding,
    train_adata_emb="X", # location of our joint embedding
    n_neighbors=50,
    )    
# perform label transfer
labels, uncert = scarches_label_transfer.weighted_knn_transfer(
    query_adata=query_embedding,
    query_adata_emb="X", # location of our joint embedding
    label_keys="Level",
    knn_model=k_neighbors_transformer,
    ref_adata_obs = reference_embedding.obs.join(cts_ordered, on='ann_finest_level')
    )
With the commands above, we labeled every cell from the query. However, some cells might have high label transfer uncertainty. It is useful to set those to "unknown" instead of giving them a cell type label. This will help highlight cell types/states that are new (i.e. not present in the reference) and possible interesting, they're worth taking a careful look at!

This is the uncertainty threshold we use in the paper, limiting the false positive rate to <0.5:

uncertainty_threshold = 0.2
We will now copy the unfiltered cell type labels ("Level_[level]_transfered_labelunfiltered"), the uncertainty scores ("Level[level]_transferuncert") and the filtered cell type labels (i.e. including "unknown", "Level[level]_transfered_label") to our combined_emb object.

labels.rename(columns={f"Level_{lev}":f"Level_{lev}_transfered_label_unfiltered" for lev in range(1,6)},inplace=True)
uncert.rename(columns={f"Level_{lev}":f"Level_{lev}_transfer_uncert" for lev in range(1,6)},inplace=True)
combined_emb.obs = combined_emb.obs.join(labels)
combined_emb.obs = combined_emb.obs.join(uncert)
# convert to arrays instead of categoricals, and set "nan" to NaN 
for col in combined_emb.obs.columns:
    if col.endswith("_transfer_uncert"):
        combined_emb.obs[col] = list(np.array(combined_emb.obs[col]))
    elif col.endswith("_transfered_label_unfiltered"):
        filtered_colname = col.replace("_unfiltered","")
        matching_uncert_col = col.replace("transfered_label_unfiltered","transfer_uncert")

        # also create a filtered version, setting cells with too high 
        # uncertainty levels to "Unknown"
        combined_emb.obs[filtered_colname] = combined_emb.obs[col]
        combined_emb.obs[filtered_colname].loc[combined_emb.obs[matching_uncert_col]>uncertainty_threshold] = "Unknown"
        # convert to categorical:
        combined_emb.obs[col] = pd.Categorical(combined_emb.obs[col])
        combined_emb.obs[filtered_colname] = pd.Categorical(combined_emb.obs[filtered_colname])
        # then replace "nan" with NaN (that makes colors better in umap)
        combined_emb.obs[col].replace("nan",np.nan,inplace=True)
        combined_emb.obs[filtered_colname].replace("nan",np.nan,inplace=True)
Let's take a look at the percentage of cells set to "unknown" after our filtering:

print(f"Percentage of unknown per level, with uncertainty_threshold={uncertainty_threshold}:")
for level in range(1,6):
    print(f"Level {level}: {np.round(sum(combined_emb.obs[f'Level_{level}_transfered_label'] =='Unknown')/query_data.n_obs*100,2)}%")
Here we show label transfer uncertainties per level. Regions with high uncertainty can highlight interesting cell types/states, not present in the reference. Note that uncertainties will get higher, the more detailed we go:

sc.pl.umap(combined_emb,color=[f"Level_{lev}_transfer_uncert" for lev in range(1,6)],ncols=5, save="_HLCA_and_query_lab_transf_uncert.png")
Now let's take a look at the transfered labels, at every level. Note that the color for "Unknown" switches per plot.

For cellxgene visualization to work, we need to set our np.nans to "nan" strings, so we'll do that here:

for level in range(1,6):
    combined_emb.obs[f"Level_{level}_transfered_label"] = pd.Categorical(
        pd.Series(combined_emb.obs[f"Level_{level}_transfered_label"].tolist()).fillna("nan")
    )
    if f"Level_{level}_transfered_label_colors" in combined_emb.uns.keys():
        print("removing colors for level", level)
        del combined_emb.uns[f"Level_{level}_transfered_label_colors"]
Now plot:

sc.pl.umap(combined_emb,color=[f"Level_{lev}_transfered_label" for lev in range(1,6)],na_color="grey",ncols=1,size=0.5, save="_HLCA_and_query_transf_labels.png")
For your reference, these are the annotations of the reference atlas:

sc.pl.umap(combined_emb, color="ann_finest_level", save="_HLCA_and_query_ct_annotations.png")
combined_emb.write(f"{output_dir}/{output_file}")
Store uncertainty summary stats for sharing with HLCA (i.e. 25th, 50th and 75th uncertainty percentile for every transfered label at every level):

min_n_cells = 20
### loop through batches
for batch in query_batches:
    emb_batch = combined_emb[combined_emb.obs[batch_variable] == batch,:].copy()
    # initiate empty dataframe:
    ds_uncert_stats = pd.DataFrame(index=range(len(emb_batch.obs.Level_5_transfered_label_unfiltered.unique())))
    # loop through levels, and store n cells, as well as 25th, 50th and 75th percentile uncertainty for every transfered label:
    for lev in range(1,6):
        lev_cts = emb_batch.obs[f"Level_{lev}_transfered_label_unfiltered"].unique()
        # group by transfered label
        g = emb_batch.obs.groupby(f"Level_{lev}_transfered_label_unfiltered")
        # count number of cells per label
        lev_cts_n = g.agg({f"Level_{lev}_transfered_label_unfiltered":"count"}).loc[lev_cts,:]
        # remove all labels with fewer than min_n_cells
        lev_cts_filt = lev_cts_n.index[lev_cts_n[f"Level_{lev}_transfered_label_unfiltered"]>=min_n_cells].tolist()
        # now store relevant stats
        ds_uncert_stats.loc[range(0,len(lev_cts_filt)),f"Level_{lev}_label"] = lev_cts_filt
        for perc in [25,50,75]:
            ds_uncert_stats.loc[range(0,len(lev_cts_filt)),f"Level_{lev}_uncert_{perc}"] = g[f'Level_{lev}_transfer_uncert'].apply(lambda x: np.percentile(x, perc)).loc[lev_cts_filt].values
    # remove empty rows
    ds_uncert_stats.dropna(axis=0,how="all",inplace=True)
    # round to three decimals:
    ds_uncert_stats = round(ds_uncert_stats,3)
    ds_uncert_stats.to_csv(f"{output_dir}/summary_stats_{batch}.csv")
Move figures folder to output directory.

!mv /fastgenomics/analysis/figures/* $output_dir/.
!rm -r /fastgenomics/analysis/figures
!touch /fastgenomics/analysis/$output_dir/.success"

Please let me know if there's anything else I can add. Thank you so much.

lisa commented 2 years ago

`@lisa` User Notification

Hello! You're receiving this notice because you've tagged me, @lisa, on GitHub, likely in error. Lisa is a common first name, and happens to be mine. I use the @lisa username here on GitHub. While being tagged in various repositories is an interesting way for me to discover new projects, it's likely not your intention. Did you mean to tag someone else? Did you mean to use @lisa instead, in a word as word usage?

Next Steps

I'm unsubscribing from this issue, pull request or project so if you meant to tag someone else to notify them, I would recommend that. If you meant to use my user ID as a plain string, without having GitHub notify me, you can do so by wrapping with backticks (the ` character), like so: @lisa.

LisaSikkema commented 2 years ago

Hi @FiaSB1 ,

Thanks for your message! The error message ValueError: Value passed for key 'PCs' is of incorrect shape. Values of varm must match dimensions (1,) of parent. Value had shape (10305, 50) while it should have had (2000,)., which seems to occur in a concatenation step: ---> 45 adata_sub = anndata.concat([adata_sub, adata_padding], axis=1, join='outer', index_unique=None, merge='unique') seems to be related to the shape of the PC matrix in your adata.obsm. Could you try removing your PC matrix (and all obsm matrices) by running: del query_adata_full.obsm after loading the data in the line query_data_full = sc.read_h5ad(test_dataset)

Just to see if that solves the problem. If so, I'll update some code to prevent this error in the future.

GitHub tip: if you print python code, start it with a line of triple quotation marks + 'python': ```python and end it with another line of triple quotation marks: ``` It will then be rendered in a cleaner way.

let me know if the removal of your obsm solves the error!

FiaSB1 commented 2 years ago

Hi @LisaSikkema,

Thank you so much for your response!

Unfortunately, I still encountered an error. Below is the error

Gene names detected: ensembl gene symbols.
1861 genes detected out of 2000 used for mapping.
Not all genes were recovered, filling in zeros for 139 missing genes...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_9/2669176631.py in <module>
----> 1 query_data = subset_and_pad_adata_object(query_data_full, reference_gene_order)

/tmp/ipykernel_9/2768654640.py in subset_and_pad_adata_object(adata, ref_genes_df, min_n_genes_included)
     43         adata_padding = sc.AnnData(df_padding)
     44         # Concatenate object
---> 45         adata_sub = anndata.concat([adata_sub, adata_padding], axis=1, join='outer', index_unique=None, merge='unique')
     46         # and order:
     47         adata_sub = adata_sub[:,ref_genes_df[gene_type]].copy()

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/merge.py in concat(adatas, axis, join, merge, uns_merge, label, keys, index_unique, fill_value, pairwise)
    905             f"{alt_dim}p": alt_pairwise,
    906             "uns": uns,
--> 907             "raw": raw,
    908         }
    909     )

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/anndata.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
    319                 varp=varp,
    320                 filename=filename,
--> 321                 filemode=filemode,
    322             )
    323 

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/anndata.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode)
    508         # TODO: Think about consequences of making obsm a group in hdf
    509         self._obsm = AxisArrays(self, 0, vals=convert_to_dict(obsm))
--> 510         self._varm = AxisArrays(self, 1, vals=convert_to_dict(varm))
    511 
    512         self._obsp = PairwiseArrays(self, 0, vals=convert_to_dict(obsp))

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/aligned_mapping.py in __init__(self, parent, axis, vals)
    230         self._data = dict()
    231         if vals is not None:
--> 232             self.update(vals)
    233 
    234 

/opt/conda/envs/scarches_0_3_5/lib/python3.7/_collections_abc.py in update(*args, **kwds)
    839             if isinstance(other, Mapping):
    840                 for key in other:
--> 841                     self[key] = other[key]
    842             elif hasattr(other, "keys"):
    843                 for key in other.keys():

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/aligned_mapping.py in __setitem__(self, key, value)
    149 
    150     def __setitem__(self, key: str, value: V):
--> 151         value = self._validate_value(value, key)
    152         self._data[key] = value
    153 

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
    213                 f"value.index does not match parent’s axis {self.axes[0]} names"
    214             )
--> 215         return super()._validate_value(val, key)
    216 
    217 

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
     51                 right_shape = tuple(self.parent.shape[a] for a in self.axes)
     52                 raise ValueError(
---> 53                     f"Value passed for key {key!r} is of incorrect shape. "
     54                     f"Values of {self.attrname} must match dimensions "
     55                     f"{self.axes} of parent. Value had shape {val.shape} while "

ValueError: Value passed for key 'PCs' is of incorrect shape. Values of varm must match dimensions (1,) of parent. Value had shape (10305, 50) while it should have had (2000,).

Below is the output of the anndata to give you a better idea of the data:

AnnData object with n_obs × n_vars = 8444 × 45947 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'CellBarcode_Identity', 'nUMI', 'nGene', 'CellType_Category', 'Manuscript_Identity', 'Subclass_Cell_Identity', 'Disease_Identity', 'Subject_Identity', 'Library_Identity', 'percent.mt', 'RNA_snn_res.0.3', 'seurat_clusters' var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable' varm: 'PCs'

Could it be that I missed something?

Thank you!

LuckyMD commented 2 years ago

Hi @FiaSB1,

You still have information from your PCA run in adata.varm (that stores the loadings of features to PCs. So if you delete the varm information by del adata.varm things should work.

LisaSikkema commented 2 years ago

Yes true forgot about the varm, thanks Malte, @FiaSB1 let us know if removing that solves the problem!

FiaSB1 commented 2 years ago

Hi @LuckyMD and @LisaSikkema,

Thank you so much for your responses! Unfortunately, I still received an error after removing the varm. Below is the error that I received. Could it be the format of the data that's causing issues?

query_data.raw = query_data
raw = query_data.raw.to_adata()
raw.X = query_data.X
query_data.raw = raw
quick check if X and raw.X have integers (do a more systematic check if you have any doubts!):

query_data.raw.X[:10, :10].toarray()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_10/2602429211.py in <module>
----> 1 query_data.raw.X[:10, :10].toarray()

AttributeError: 'numpy.ndarray' object has no attribute 'toarray'
query_data.X[:10, :10].toarray()
Set query_data.obs["scanvi_label"] to "unlabeled". Keep this code as is, it has to do with the way the reference model was changed.

query_data.obs["scanvi_label"] = "unlabeled"

Thank you so much!

LisaSikkema commented 2 years ago

Okay the first error is resolved at least! This one is related to your count matrix not being sparse, you can make it sparse by running: from scipy import sparse query_data.X = sparse.csr_matrix(query_data.X)

LuckyMD commented 2 years ago

Just to keep note of assumptions that we can add the the description in the notebook in future:

Sparse expression matrix
No reduced dimensions stored in .obsm or .varm (note: this hinders AnnData concatentation)

FiaSB1 commented 2 years ago

Hi @LisaSikkema and @LuckyMD,

Thank you so much for the suggestion! That perfectly solves that error! Unfortunately, the analysis still returned an error. My apologies I have no idea what it means.

KeyError                                  Traceback (most recent call last)
/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3080             try:
-> 3081                 return self._engine.get_loc(casted_key)
   3082             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'dataset'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_10/772632715.py in <module>
      1 batch_variable = "dataset"  # the column name under which you stored your batch variable
----> 2 query_batches = sorted(query_data.obs[batch_variable].unique())
      3 print(query_batches)

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3022             if self.columns.nlevels > 1:
   3023                 return self._getitem_multilevel(key)
-> 3024             indexer = self.columns.get_loc(key)
   3025             if is_integer(indexer):
   3026                 indexer = [indexer]

/opt/conda/envs/scarches_0_3_5/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3081                 return self._engine.get_loc(casted_key)
   3082             except KeyError as err:
-> 3083                 raise KeyError(key) from err
   3084 
   3085         if tolerance is not None:

KeyError: 'dataset'

Can I get a hand with this too please? Is it a naming issue? As in the column is supposed to be 'dataset'?

Thank you so much for all your help. I really appreciate it.

LuckyMD commented 2 years ago

Hi @FiaSB1,

If you only have one dataset just add: adata.obs['dataset'] = "My dataest"

To your anndata object. That will address this error. If your query object has multiple datasets/batches, then please add a column that specifies which cells belong to which dataset or batch here.

LisaSikkema commented 2 years ago

@LuckyMD I will just adapt the scripts so that these things are automatically done. @FiaSB1 that is part of the prerequisites of your data to run this code, improved/clarified the description on another page but forgot to do it on GitHub. It's now fixed, you can find it here in the readme file. You'll need to specificy the batches in the data you're working with under adata.obs["dataset"].

Thanks for going through all this, I'll make sure that the next person won't encounter these problems!

LuckyMD commented 2 years ago

@LisaSikkema I should have an updated function to make this work as well. I just need to make sure that the current HLCA model can be loaded as well.

LisaSikkema commented 2 years ago

a function that removes obsm and varm and makes the adata.X sparse? That should be only three lines of code right?

LuckyMD commented 2 years ago

I guess I have a bit more than that, including specifying where the counts data is stored, setting cell type key, batch key, and unlabeled category... but I am using updated scvi-tools (which has prep functions that take care of some of this stuff like padding 0s etc)... so not sure if this is what we want actually.

LisaSikkema commented 2 years ago

Yeah then probably not ideal for this specific setting

FiaSB1 commented 2 years ago

Hi @LuckyMD @LisaSikkema,

Thank you so much for your response! It seems like the website is currently down, but I will get back to you with my results as soon as possible. Crossing my fingers this time the mapping runs smoothly!

Thank you!

FiaSB1 commented 2 years ago

Hi @LuckyMD @LisaSikkema,

Hope you are well! Sorry I took awhile to get back to you.

The previous error was thankfully resolved! Although, I then realised that my data was not rawcounts so I had to make several changes.

After putting just the rawcounts of one sample, the analysis succeeded!! Thank you so much for all your help!

I then tried to map the entire data which is 3GB in size. Unfortunately, I received this error:

INFO:jupyterfg.execute:Stripping output from notebook analysis/LCA_scArches_mapping_new_data_to_hlca.ipynb.
INFO:jupyterfg.execute:Executing notebook analysis/LCA_scArches_mapping_new_data_to_hlca.ipynb.
ERROR:traitlets:Kernel died while waiting for execute reply.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/nbclient/client.py", line 627, in _async_poll_for_reply
    msg = await ensure_async(self.kc.shell_channel.get_msg(timeout=new_timeout))
  File "/opt/conda/lib/python3.9/site-packages/nbclient/util.py", line 89, in ensure_async
    result = await obj
  File "/opt/conda/lib/python3.9/site-packages/jupyter_client/channels.py", line 224, in get_msg
    ready = await self.socket.poll(timeout)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/nbclient/client.py", line 846, in async_execute_cell
    exec_reply = await self.task_poll_for_reply
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.9/site-packages/jupyterfg/__main__.py", line 43, in <module>
    main(args.notebook, args.cell_timeout)
  File "/opt/conda/lib/python3.9/site-packages/jupyterfg/__main__.py", line 38, in main
    execute_and_save(nb_file, cell_timeout=cell_timeout)
  File "/opt/conda/lib/python3.9/site-packages/jupyterfg/execute.py", line 37, in execute_and_save
    ep.preprocess(nb, res)
  File "/opt/conda/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py", line 84, in preprocess
    self.preprocess_cell(cell, resources, index)
  File "/opt/conda/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py", line 105, in preprocess_cell
    cell = self.execute_cell(cell, index, store_history=True)
  File "/opt/conda/lib/python3.9/site-packages/nbclient/util.py", line 78, in wrapped
    return just_run(coro(*args, **kwargs))
  File "/opt/conda/lib/python3.9/site-packages/nbclient/util.py", line 57, in just_run
    return loop.run_until_complete(coro)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.9/site-packages/nbclient/client.py", line 850, in async_execute_cell
    raise DeadKernelError("Kernel died")
nbclient.exceptions.DeadKernelError: Kernel died

Do you reckon this is because the file is still too big? The adata.X is a sparse matrix with float64 dtype. Would converting it to float32 be a potential solution?

Thank you so much once again!

LuckyMD commented 2 years ago

Kernel errors are really difficult to address. It could be a memory limitation of your system. I'm actually not sure if you can reduce to float32, haven't tried that yet.

Are you planning to correct every sample separately? Or do you only expect a dataset batch? In the former case, you could map every sample separately as well... it's the same if you combine all the samples and then use the sample as a batch (encoded in adata.obs['dataset']).

LisaSikkema commented 2 years ago

Yeah also not sure where this comes from, in which part of the notebook did it happen?

I also noticed you're using python 3.9, which is not the recommended python version to use with this version of scArches. You can try installing the conda environment that I prepared in the GitHub repo, although I am not sure if that's in any way related to the error you're observing here, probably not...

Also, are you running this with GPU? That's also recommended, on CPU it will be much slower, and it might also result in memory issues faster (although again not sure).

FiaSB1 commented 2 years ago

Hi @LuckyMD,

I was planning to try both. But seeing as how mapping one sample was successful and it would be the same if all the samples are combined, then I will probably just the results from the one sample.

@LisaSikkema thank you for the suggestions! I will try to use a different version of python to see if that works. I have been using CPU, I'll try to get access to GPU and run it with that.

Thank you so much for all the help! In any case, I'm grateful that we got to resolve the errors and successfully analyse the data!

LisaSikkema commented 1 year ago

Better late than never... the above-mentioned issues related to:

setting the "dataset" variable
obsm and varm dimensions causing problems with anndata concatenation
.X not being sparse should now be fixed. Closing this issue.

LungCellAtlas / HLCA

Error with Mapping to HLCA #2

Hi there,

`@lisa` User Notification

Next Steps

LungCellAtlas / HLCA

Error with Mapping to HLCA #2

Hi there,

@lisa User Notification

Next Steps

`@lisa` User Notification