What is `output/attachment/RNAvelocity_matrix/spanning.mtx.gz` ?

h4rvey-g commented 5 days ago

Hi! I read the function in https://github.com/MGI-tech-bioinformatics/DNBelab_C_Series_HT_scRNA-analysis-software/issues/67#issuecomment-2060288401 but didn't see that you are using spanning.mtx.gz. What is this file? Additionally, if I want to load the files generated by the dnbc workflow into Python and use scVelo for downstream analysis, which files should I choose to load to construct the anndata object? Thanks for your help.

h4rvey-g commented 5 days ago

Also I'm getting a low spliced/unspliced ratio for every sample

import pandas as pd
import scanpy as sc

adata = sc.read_mtx("data/103.self_workflow/T1/output/filter_matrix/matrix.mtx.gz")
adata_bc = pd.read_csv(
    "data/103.self_workflow/T1/output/attachment/RNAvelocity_matrix/barcodes.tsv.gz",
    header=None,
)
adata_features = pd.read_csv(
    "data/103.self_workflow/T1/output/attachment/RNAvelocity_matrix/features.tsv.gz",
    header=None,
)
adata = adata.T
adata.obs["cell_id"] = adata_bc
adata.var["gene_name"] = adata_features[0].tolist()
adata.var.index = adata.var["gene_name"]
adata_spliced = sc.read_mtx(
    "data/103.self_workflow/T1/output/attachment/RNAvelocity_matrix/spliced.mtx.gz"
)
adata_spliced = adata_spliced.T
adata_unspliced = sc.read_mtx(
    "data/103.self_workflow/T1/output/attachment/RNAvelocity_matrix/unspliced.mtx.gz"
)
adata_unspliced = adata_unspliced.T
# combine the spliced and unsplieced data
adata.layers["spliced"] = adata_spliced.X
adata.layers["unspliced"] = adata_unspliced.X
scv.pl.proportions(adata)

Here's the workflow stats Any insights on this? Thank you.

lishuangshuang0616 commented 3 days ago

"Spanning" refers to spanning intron-exon junctions. Due to the current annotation logic in the software, if 50% of the read is mapped to an exon, it is considered exon, and the rest is considered intron if mapped to the gene. Therefore, this region currently has no data in the software version. Reads mapped to exonic regions are considered spliced, while those mapped to intronic or spanning regions are considered unspliced. Since your sample is nuclear data, a higher proportion of unspliced reads is quite normal. I do not have experience with using anndata to analyze velocyto, but I will look into this issue .

h4rvey-g commented 1 day ago

Got it. Thank you.

MGI-tech-bioinformatics / DNBelab_C_Series_HT_scRNA-analysis-software

What is `output/attachment/RNAvelocity_matrix/spanning.mtx.gz` ? #129