Too many obs - Githubissues

partrita commented 9 months ago

❯ tree . -L 1
.d
├── neuron_1k_v3_fastqs
├── pbmc_1k_v3_fass.tar
├── refdata-gex-GRCh38-nfo.refdata-gex-mm10-2020-As_all.h5ad

export ALEVIN_FRY_HOME="$PWD"
export REF_DIR="$ALEVIN_FRY_HOME/refdata-gex-mm10-2020-A"
export IDX_DIR="$ALEVIN_FRY_HOME/mouse-2020-A_splici"

simpleaf set-paths
ulimit -n 2048

gunzip -c neuron_1k_v3_fastqs/neuron_1k_v3_S1_L001_R2_001.fastq.gz | head | sed -n '2p' | tr -d '\n' | wc -m

output

$ simpleaf index --output $IDX_DIR \
--fasta $REF_DIR/fasta/genome.fa \
--gtf $REF_DIR/genes/genes.gtf
--threads 30 \
--rlen 91 --use-piscem

output
```bash
2023-09-26T04:49:42.403568Z  INFO simpleaf::simpleaf_commands::indexing: preparing to make reference with roers
2023-09-26T04:49:46.965573Z  INFO grangers::reader::gtf: Finished parsing the input file. Found 5 comments and 1780455 records.
2023-09-26T04:49:47.661445Z  INFO roers: Built the Grangers object for 1780455 records
2023-09-26T04:49:47.931838Z  INFO roers: Proceed 805200 exon records from 118153 transcripts
2023-09-26T04:49:54.522440Z  INFO roers: Processing 687047 intronic records
2023-09-26T04:50:09.572929Z  INFO roers: Done!
2023-09-26T04:50:09.582414Z  INFO simpleaf::simpleaf_commands::indexing: piscem build cmd : /home/fkt/mambaforge/envs/af/bin/piscem build -k 31 -m 19 -o /home/fkt/Downloads/cellranger-7.2.0/data/mouse-2020-A_splici/index/piscem_idx -s /home/fkt/Downloads/cellranger-7.2.0/data/mouse-2020-A_splici/ref/roers_ref.fa --threads 16
2023-09-26T04:55:56.391304Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 
# Simpleaf quant
0)

export FASTQ_DIR="$ALEVIN_FRY_HOME/neuron_1k_v3_fastq"

# Obtain and sort filenames
reads1="$(find -L ${FASTQ_DIR} -name "*_R1_*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
reads2="$(find -L ${FASTQ_DIR} -name "*_R2_*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"

simpleaf quant --reads1 $reads1 --reads2 $reads2 \
--threads 30 --index $IDX_DIR/index --chemistry 10xv3 --resolution cr-like \
--expected-ori fw --unfiltered-pl --t2g-map $IDX_DIR/index/t2g_3col.tsv \
--output $FASTQ_DIR/simpleaf_output

2023-09-26T04:59:24.169800Z  INFO simpleaf::simpleaf_commands::quant: piscem map-sc cmd : /home/fkt/mambaforge/envs/af/bin/piscem map-sc --index /home/fkt/Downloads/cellranger-7.2.0/data/mouse-2020-A_splici/index/piscem_idx --threads 30 -o /home/fkt/Downloads/cellranger-7.2.0/data/neuron_1k_v3_fastqs/simpleaf_output/af_map -1 /home/fkt/Downloads/cellranger-7.2.0/data/neuron_1k_v3_fastqs/neuron_1k_v3_S1_L001_R1_001.fastq.gz,/home/fkt/Downloads/cellranger-7.2.0/data/neuron_1k_v3_fastqs/neuron_1k_v3_S1_L002_R1_001.fastq.gz -2 /home/fkt/Downloads/cellranger-7.2.0/data/neuron_1k_v3_fastqs/neuron_1k_v3_S1_L001_R2_001.fastq.gz,/home/fkt/Downloads/cellranger-7.2.0/data/neuron_1k_v3_fastqs/neuron_1k_v3_S1_L002_R2_001.fastq.gz --geometry chromium_v3
2023-09-26T05:00:09.415913Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-09-26T05:00:09.415938Z  INFO simpleaf::simpleaf_commands::quant: alevin-fry generate-permit-list cmd : /home/fkt/mambaforge/envs/af/bin/alevin-fry generate-permit-list -i /home/fkt/Downloads/cellranger-7.2.0/data/neuron_1k_v3_fastqs/simpleaf_output/af_map -d fw --unfiltered-pl /home/fkt/Downloads/cellranger-7.2.0/data/plist/10x_v3_permit.txt --min-reads 10 -o /home/fkt/Downloads/cellranger-7.2.0/data/neuron_1k_v3_fastqs/simpleaf_output/af_quant
2023-09-26T05:00:13.918771Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-09-26T05:00:13.918791Z  INFO simpleaf::simpleaf_commands::quant: alevin-fry collate cmd : /home/fkt/mambaforge/envs/af/bin/alevin-fry collate -i /home/fkt/Downloads/cellranger-7.2.0/data/neuron_1k_v3_fastqs/simpleaf_output/af_quant -r /home/fkt/Downloads/cellranger-7.2.0/data/neuron_1k_v3_fastqs/simpleaf_output/af_map -t 30
2023-09-26T05:00:15.526893Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-09-26T05:00:15.526916Z  INFO simpleaf::simpleaf_commands::quant: cmd : "/home/fkt/mambaforge/envs/af/bin/alevin-fry" "quant" "-i" "/home/fkt/Downloads/cellranger-7.2.0/data/neuron_1k_v3_fastqs/simpleaf_output/af_quant" "-o" "/home/fkt/Downloads/cellranger-7.2.0/data/neuron_1k_v3_fastqs/simpleaf_output/af_quant" "-t" "30" "-m" "/home/fkt/Downloads/cellranger-7.2.0/data/mouse-2020-A_splici/index/t2g_3col.tsv" "-r" "cr-like"
2023-09-26T05:00:18.556811Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)

# Run in python
from pyroe import load_fry
adata = load_fry("neuron_1k_v3_fastqs/simpleaf_output/af_quant", output_format = {'X' : ['S', 'A'],'unspliced' : ['U']})
adata

output

AnnData object with n_obs × n_vars = 102935 × 32285
    obs: 'barcodes'
    layers: 'spliced', 'unspliced'

There is too many n_obs, and I don't know why. It should be 1311 like below(cell ranger output).

AnnData object with n_obs × n_vars = 1311 × 32285

please, help me out.

rob-p commented 9 months ago

Hi @partrita,

This seems right, actually. The discrepancy is that you are requesting to quantify according to the unfiltered permit list --unfiltered-pl, whereas your Cell Ranger result is likely the result after cell filtering. You should continue with your raw count matrix (102935 × 32285) and apply a cell filtering to filter out likely spurious or low-quality cells. Perhaps @DongzeHE can point you at the recommended filtering strategy within the scanpy ecosystem.

Best, Rob

DongzeHE commented 9 months ago

Hi @partrita,

If you have experience in rpy2, I would suggest you call the emptyDrops function from python directly. Alternatively, you might find the driokick Python package useful.

Following, I will provide the example code of calling the emptyDrops function in an ipython instance or a jupyter notebook:

First, we load the matrix in python

# can be installed via conda from the Bioconda channel
import pyroe
import os
pyroe.__version__ # should >= 0.8.1

# obtain env variable
WORK_DIR = os.getenv('FASTQ_DIR')
IDX_DIR = os.getenv('IDX_DIR')

# define count matrix path
quant_dir = os.path.join(WORK_DIR, 'simpleaf_output', 'af_quant')
id2name_file = os.path.join(IDX_DIR, 'ref', 'gene_id_to_name.tsv')

# define the output format for allocating the unspliced, spliced, and ambiguous counts. 
custom_format = {'X' : ['S','A'],
                  'unspliced' : ['U'],
                  'all' : ['U','S','A'],}

# load the unfiltered (raw) count matrix
adata_raw = pyroe.load_fry(quant_dir, output_format=custom_format)

# read gene name to id mapping generated by `simpleaf index`
id2name = dict(map(str.split, open(id2name_file,"r")))

# convert Ensembl ID to gene name and build adata_raw.var 
adata_raw.var['gene_ids'] = adata_raw.var_names
adata_raw.var['feature_types'] = 'Gene Expression'
adata_raw.var['genome'] = 'GRCh38'
adata_raw.var_names = [id2name[id] for id in adata_raw.var_names]

total_count = adata_raw.layers["all"].T.tocoo()

Second, we call the emptyDrops function and filter the count matrix

import numpy as np
# conda install -c conda-forge rpy2
import rpy2.robjects as ro
from rpy2.robjects.packages import importr

#  build an R sparse matrix for the raw count matrix
# conda install r-matrix
r_Matrix = importr("Matrix")
# conda install bioconductor-dropletutils
r_DropletUtils = importr("DropletUtils")
m = r_Matrix.sparseMatrix(
    i=ro.IntVector(total_count.row + 1),
    j=ro.IntVector(total_count.col + 1),
    x=ro.FloatVector(total_count.data),
    dims=ro.IntVector(total_count.shape))

# run emptyDrops and filter cells based on FDR
fdr_thresh = 0.01
out = r_DropletUtils.emptyDrops(m)
FDR = list(out.slots['listData'].rx2("FDR"))
is_cell = [False if np.isnan(v) else v <= fdr_thresh for v in FDR]

# Filter raw count matrix
adata = adata_raw[is_cell,:]

Please let me know if you need help running the above code. Thanks.

Best, Dongze

partrita commented 9 months ago

Many thanks to @rob-p @DongzeHE . Now I got it.

COMBINE-lab / simpleaf

Too many obs #106