Closed ddiez closed 4 months ago
Feature ids in features.tsv.gz are all characterized as unknown, instead of including, I assume, the ref_gene_id (ensembl):
This is the expected behaviour currently. The code that writes out the MTX outputs doesn't have access to the data for that column of the features file. Rather than deviate from the 10X-style output we chose to stub out the column with the "unknown" text.
the value for mito_pct is 0 for all barcodes.
The genes will always be listed in the features file regardless of their abundances: the file constitutes an index for the sparse matrix in the MTX file.
Feature ids in features.tsv.gz are all characterized as unknown, instead of including, I assume, the ref_gene_id (ensembl):
This is the expected behaviour currently. The code that writes out the MTX outputs doesn't have access to the data for that column of the features file. Rather than deviate from the 10X-style output we chose to stub out the column with the "unknown" text.
Ah, I apologize for the noise in this one. For some reason I thought the transcript_raw_feature_bc_matrix
also contained gene symbols instead of transcript ids, which was my primary concern. I should be more careful and not submit issues when tired. Sorry about that. I guess it would be better if for gene_raw_feature_bc_matrix
we had the ensembl gene ids instead of the unknown but I also agree this is better than deviating from 10x output.
the value for mito_pct is 0 for all barcodes.
The genes will always be listed in the features file regardless of their abundances: the file constitutes an index for the sparse matrix in the MTX file.
Yes, I understand this, but I fear I did not explain properly the issue. For example, in the demo data this is a sample of the counts for mitochondrial genes in gene_raw_feature_bc_matrix
:
MT-ATP6 1 . 1 . . . . 1 . 1
MT-CO1 . 1 1 2 . 1 1 . 1 1
MT-CO2 . . 1 . 1 . . . . 1
MT-CO3 1 . 1 1 . . . 1 1 1
MT-CYB . . . 1 . . . . . .
MT-ND1 . . . 2 . . . . . .
MT-ND2 . . . . . . . . 1 1
MT-ND3 . . . . . . . . . .
MT-ND4 . . . . . . . 1 . 1
MT-ND4L . . . . . . . . . .
MT-ND5 . . . . . . . . . .
In spite of this, the file gene.expression.mito-per-cell.tsv
shows mito_pct of 0 for all barcodes. And the UMAP in the report showing mitochondrial pct content (wf-single-cell-report.html
) shows all zero values.
I think this is only an issue with the report and perhaps the gene.expression.mito-per-cell.tsv
file, since the mito data is correctly included in the matrix file we use for analysis.
Sorry, I clicked send before writing everything I meant to write.
The code for handling and transforming the counts was almost entirely rewritten (twice! A first pass to rationalise memory use, a second for performance). It's very possible we've introduced a bug there. We need to add more tests to the code to catch this stuff!
We'll take a look at this early next week. (Be aware we released a patch v2.0.1 -- this does not contain a fix for this issue).
@cjw85 thanks for letting me know! I will keep an eye on new versions.
v2.0.2 should make its way to GitHub this afternoon and fixes the zeroes in the gene.expression.mito-per-cell.tsv
file.
Operating System
Other Linux (please specify below)
Other Linux
Ubuntu 23.10
Workflow Version
v2.0.0
Workflow Execution
Command line (Local)
Other workflow execution
No response
EPI2ME Version
No response
CLI command run
nextflow run epi2me-labs/wf-single-cell \ -w workspace \ -r master \ -profile standard \ --fastq wf-single-cell-demo/fastq/A/chr17.fq.gz \ --kit_name 3prime \ --kit_version v3 \ --expected_cells 500 \ --ref_genome_dir ~/10x/refdata-gex/refdata-gex-GRCh38-2020-A \ --out_dir single-cell-demo-out_latest_single \ --umap_n_repeats 1
Workflow Execution - CLI Execution Profile
standard (default)
What happened?
First of all, congrats for a great v2 release that has so many improvements. Now the pipeline runs quickly and successfully on all my datasets. Thanks for the great work. I have found a couple of problems that I detail here:
features.tsv.gz
are all characterized as unknown, instead of including, I assume, the ref_gene_id (ensembl):*.expression.mito-per-cell.tsv
the value formito_pct
is 0 for all barcodes. I confirm that mitochondrial genes are found in thefeatures.tsv.gz
file:I have found these problems in my own datasets too, both in human and mouse samples.
Relevant log output
Application activity log entry
No response
Were you able to successfully run the latest version of the workflow with the demo data?
yes
Other demo data information
No response