Closed partrita closed 9 months ago
Hi @partrita,
This seems right, actually. The discrepancy is that you are requesting to quantify according to the unfiltered permit list --unfiltered-pl
, whereas your Cell Ranger result is likely the result after cell filtering. You should continue with your raw count matrix (102935 × 32285) and apply a cell filtering to filter out likely spurious or low-quality cells. Perhaps @DongzeHE can point you at the recommended filtering strategy within the scanpy ecosystem.
Best, Rob
Hi @partrita,
If you have experience in rpy2, I would suggest you call the emptyDrops function from python directly. Alternatively, you might find the driokick Python package useful.
Following, I will provide the example code of calling the emptyDrops function in an ipython instance or a jupyter notebook:
First, we load the matrix in python
# can be installed via conda from the Bioconda channel
import pyroe
import os
pyroe.__version__ # should >= 0.8.1
# obtain env variable
WORK_DIR = os.getenv('FASTQ_DIR')
IDX_DIR = os.getenv('IDX_DIR')
# define count matrix path
quant_dir = os.path.join(WORK_DIR, 'simpleaf_output', 'af_quant')
id2name_file = os.path.join(IDX_DIR, 'ref', 'gene_id_to_name.tsv')
# define the output format for allocating the unspliced, spliced, and ambiguous counts.
custom_format = {'X' : ['S','A'],
'unspliced' : ['U'],
'all' : ['U','S','A'],}
# load the unfiltered (raw) count matrix
adata_raw = pyroe.load_fry(quant_dir, output_format=custom_format)
# read gene name to id mapping generated by `simpleaf index`
id2name = dict(map(str.split, open(id2name_file,"r")))
# convert Ensembl ID to gene name and build adata_raw.var
adata_raw.var['gene_ids'] = adata_raw.var_names
adata_raw.var['feature_types'] = 'Gene Expression'
adata_raw.var['genome'] = 'GRCh38'
adata_raw.var_names = [id2name[id] for id in adata_raw.var_names]
total_count = adata_raw.layers["all"].T.tocoo()
Second, we call the emptyDrops function and filter the count matrix
import numpy as np
# conda install -c conda-forge rpy2
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
# build an R sparse matrix for the raw count matrix
# conda install r-matrix
r_Matrix = importr("Matrix")
# conda install bioconductor-dropletutils
r_DropletUtils = importr("DropletUtils")
m = r_Matrix.sparseMatrix(
i=ro.IntVector(total_count.row + 1),
j=ro.IntVector(total_count.col + 1),
x=ro.FloatVector(total_count.data),
dims=ro.IntVector(total_count.shape))
# run emptyDrops and filter cells based on FDR
fdr_thresh = 0.01
out = r_DropletUtils.emptyDrops(m)
FDR = list(out.slots['listData'].rx2("FDR"))
is_cell = [False if np.isnan(v) else v <= fdr_thresh for v in FDR]
# Filter raw count matrix
adata = adata_raw[is_cell,:]
Please let me know if you need help running the above code. Thanks.
Best, Dongze
Many thanks to @rob-p @DongzeHE . Now I got it.
output
output
There is too many n_obs, and I don't know why. It should be 1311 like below(cell ranger output).
please, help me out.