MGI-tech-bioinformatics / DNBelab_C_Series_HT_scRNA-analysis-software

An open source and flexible pipeline to analysis high-throughput DNBelab C Series single-cell RNA datasets
MIT License
72 stars 24 forks source link

What is `output/attachment/RNAvelocity_matrix/spanning.mtx.gz` ? #129

Closed h4rvey-g closed 1 day ago

h4rvey-g commented 5 days ago

Hi! I read the function in https://github.com/MGI-tech-bioinformatics/DNBelab_C_Series_HT_scRNA-analysis-software/issues/67#issuecomment-2060288401 but didn't see that you are using spanning.mtx.gz. What is this file? Additionally, if I want to load the files generated by the dnbc workflow into Python and use scVelo for downstream analysis, which files should I choose to load to construct the anndata object? Thanks for your help.

h4rvey-g commented 5 days ago

Also I'm getting a low spliced/unspliced ratio for every sample

import pandas as pd
import scanpy as sc

adata = sc.read_mtx("data/103.self_workflow/T1/output/filter_matrix/matrix.mtx.gz")
adata_bc = pd.read_csv(
    "data/103.self_workflow/T1/output/attachment/RNAvelocity_matrix/barcodes.tsv.gz",
    header=None,
)
adata_features = pd.read_csv(
    "data/103.self_workflow/T1/output/attachment/RNAvelocity_matrix/features.tsv.gz",
    header=None,
)
adata = adata.T
adata.obs["cell_id"] = adata_bc
adata.var["gene_name"] = adata_features[0].tolist()
adata.var.index = adata.var["gene_name"]
adata_spliced = sc.read_mtx(
    "data/103.self_workflow/T1/output/attachment/RNAvelocity_matrix/spliced.mtx.gz"
)
adata_spliced = adata_spliced.T
adata_unspliced = sc.read_mtx(
    "data/103.self_workflow/T1/output/attachment/RNAvelocity_matrix/unspliced.mtx.gz"
)
adata_unspliced = adata_unspliced.T
# combine the spliced and unsplieced data
adata.layers["spliced"] = adata_spliced.X
adata.layers["unspliced"] = adata_unspliced.X
scv.pl.proportions(adata)

image Here's the workflow stats image Any insights on this? Thank you.

lishuangshuang0616 commented 3 days ago

"Spanning" refers to spanning intron-exon junctions. Due to the current annotation logic in the software, if 50% of the read is mapped to an exon, it is considered exon, and the rest is considered intron if mapped to the gene. Therefore, this region currently has no data in the software version. Reads mapped to exonic regions are considered spliced, while those mapped to intronic or spanning regions are considered unspliced. Since your sample is nuclear data, a higher proportion of unspliced reads is quite normal. I do not have experience with using anndata to analyze velocyto, but I will look into this issue .

h4rvey-g commented 1 day ago

Got it. Thank you.