epi2me-labs / wf-single-cell

Other
75 stars 39 forks source link

Only unknown transcripts #140

Closed HenriettaHolze closed 1 month ago

HenriettaHolze commented 2 months ago

Ask away!

Hi, I would look at transcript usage but all transcripts are "unknown". The file transcript_processed_feature_bc_matrix/features.tsv.gz contains only "unknown" transcripts.

unknown_00000   ENST00000000233 Gene Expression
unknown_00001   ENST00000000412 Gene Expression
unknown_00002   ENST00000000442 Gene Expression
unknown_00003   ENST00000001008 Gene Expression
unknown_00004   ENST00000002125 Gene Expression

The transcriptome.gff.gz looks as follows:

chr3    StringTie       transcript      9789378 9792721 .       +       .       transcript_id "chr3.stringtie.38.1"; gene_id "chr3.stringtie.38"; gene_name "ARPC4"; xloc "XLOC_000017"; ref_gene_id "ENSG00000241553"; cmp_ref "ENST00000498623"; class_code "o"; tss_id "TSS21";
chr3    StringTie       exon    9789378 9789963 .       +       .       transcript_id "chr3.stringtie.38.1"; gene_id "chr3.stringtie.38"; exon_number "1";
chr3    StringTie       exon    9791260 9791487 .       +       .       transcript_id "chr3.stringtie.38.1"; gene_id "chr3.stringtie.38"; exon_number "2";
chr3    StringTie       exon    9792209 9792721 .       +       .       transcript_id "chr3.stringtie.38.1"; gene_id "chr3.stringtie.38"; exon_number "3";

The pipeline takes a gtf file with transcript information as input.
Does this output mean that no transcripts given in the gff file were recovered, or does the pipeline not match transcripts identified by stringtie to known transcripts? Or is there an option to not run stringtie but only quantify known transcripts given in the gtf file?

Cheers,
Henrietta

HenriettaHolze commented 2 months ago

Never mind, I clearly just can't read.
The known transcript IDs are given in the features.tsv.gz file, I only need to match them to the genes.

Different question now though, all features in transcript_raw_feature_bc_matrix/features.tsv.gz and transcript_processed_feature_bc_matrix/features.tsv.gz have a ENST... ID but transcriptome.gff.gz contains transcripts without the cmp_ref flag. Are the novel transcripts not quantified per cell?

Cheers

nrhorner commented 2 months ago

Hi @HenriettaHolze

Yes, at the moment novel isoforms are not quantified in the expression matrices.

Cheers,

Neil.

HenriettaHolze commented 2 months ago

Thanks for the confirmation @nrhorner . Is there a way I could create an expression matrix of known and novel isoforms?

MustafaElshani commented 1 month ago

I'm really interested in this , would be great

nrhorner commented 1 month ago

Hi both. This is something that will be added in a future release.

nrhorner commented 1 month ago

Closing for now. Will let you know when we have released a version with this feature